In our previous blog post, we discussed the importance of data cleansing and pre-processing for making sure that our machine learning methods work correctly. After applying all the necessary cleansing tools, we end up with a validated, structured data set without any aberrant data that may otherwise negatively affect our training. However, applying data cleansing techniques alone does not guarantee that our machine learning methods will work correctly. This is where data transformation tools are essential, not only to improve the effectiveness of machine learning, but also to ensure that it can be performed successfully.
Data transformation techniques
A variety of data transformation techniques are available. Some apply to specific variables in our data set, and others apply generally across all data being changed.
- MinMax scaling: It is very common for the numerical values in our data to have different ranges. Imagine that we have a data set related to the health of a group of people. One of the measurements that we are recording is the number of red blood cells, which has an average value of about 4.5 million. Another measurement is uric acid, which has a normal range of between 4.2 and 8 milligrams per decilitre (mg/dL). If we try to train a system that takes this data into consideration, the disparity between the red blood cell values and uric acid values will cause the latter to be practically negligible. MinMax scaling allows all values to be scaled between 0 and 1 while maintaining the original data distribution.
- Standard scaling: At other times, our problem is that the range of some of the variables is greater than the rest. Continuing with the previous example, normal iron levels range between 65 and 170 mg/dL while normal glucose levels range between 74 and 106 mg/dL. In this case, both fields have a similar magnitude, but the range of possible values for iron is much wider than the range for glucose. This difference may have an impact on the machine learning model as it would need to understand that changing a parameter related to the glucose variable does not have the same effect as changing, by the same amount, another parameter related to the iron variable. To facilitate this and ensure that all data has a similar distribution, standard scaling is used. This transforms the original data into other data whose statistical distribution has a mean of 0 and a standard deviation of 1.
- Data discretisation: It is often the case that some of the data we have will provide more information than we really need. Using the same set of clinical data as before, imagine this time that we are simply interested in knowing whether the patient has a good blood cholesterol level and that our data provides the amount of cholesterol in milligrams per decilitre. We may not be interested in knowing the specific number but simply want to know whether the values are normal or not. Discretising the data involves converting a continuous variable into a finite set of values. In this case, three categories could be defined: “Healthy” when the value is below 200 mg/dL, “Warning” when it is between 200 and 239 mg/dL and “Danger” when it exceeds 240 mg/dL. By doing this, we reduce the complexity of the problem and can achieve better results in a much shorter time.
- Data encoding: As we said in our previous post on data cleansing, in order to work with machine learning models such as artificial neural networks, the data has to be numerical. However, the variables in our data are very often dates, categories and other text. For our machine learning model to be able to take this data into account, we have to perform transformation processes that convert all this non-numerical data into numerical data. There are many techniques for converting this type of data into numbers:
- Dates: Date fields can easily be converted into numbers using Unix time, which is the number of seconds since midnight UTC on 1 January 1970. There are other methods, however, such as using sigmoid functions that represent the date in a cyclical manner.
- Text: To encode text as numbers, you can use techniques such as Bag of Words, which count how often words are repeated in the text. There is a very important field of research related to text processing called Natural Language Processing, usually shortened to NLP. We will discuss this topic in future blog posts.
- Categories: To encode categories such as those we created after discretising the cholesterol data, each category can be converted into a numerical value (Healthy = 0; Warning = 1; Danger = 2). On this occasion, the categorisation is meaningful because the order relationship between the categories is preserved: being in the Danger category is closer to Warning than to Healthy. However, there are many occasions when the categories do not have any logical order relationship. For example, if we encode having a pet as Dog = 0, Cat = 1, Bird = 2 and Fish = 3, this would indicate that a fish is closer to being a bird than to being a dog, which makes no sense at all.To solve this, we use a technique called OneHotEncoding, which converts each category into a new Boolean variable in the data set. For example, if we have a record with the value Bird in the pet field and then apply OneHotEncoding, we end up with a bird field with value 1 (True) and three other pet fields with value 0 (False). This can significantly increase the number of variables in our data set, but it does allow us to process categorical fields in the correct way. If the number of variables increases too much, we can use dimensionality reduction techniques to describe each record using only the most relevant variables. We will discuss these techniques in future blog posts.
In our last two posts on Artificial Intelligence, we’ve outlined the steps we need to take when presented with a new data set:
- Search for empty fields.
- Validate that the values match the field type.
- Delete duplicate data.
- Detect outliers.
- Scale each of the variables.
- Discretise data that provides more information than we need.
- Encode non-numerical variables so that they can be input into the machine learning system.
- Select the most relevant variables.
These data pre-processing techniques are the first step that the Xeridia Artificial Intelligence team takes when we approach a Machine Learning project, extracting the information that is of most value to the successful development of our customers’ projects. If you’d like to know more about how we can help you to improve your processes with AI applications or about Xeridia’s AI success stories, please get in touch using the contact form on our website.
Oscar García-Olalla Olivera is a Data Scientist and R&D Engineer at Xeridia