Machine Learning with AWS - Part 1: Data Preparation

Machine Learning with AWS - Part 1: Data Preparation

Welcome back! Let's dig into the first step of succesfully using machine learning in your business: Data preparation.

Kirk Ryan · 1 minute read

Data preparation is technically one of the easiest to complete, but difficult to master elements of implementing a successful machine learning project in your business.

It usually consists of:

  1. Data cleaning - is all of your data valid - are there erroroneous values that will confuse your model? Are you sure that the data collected is valid and accurate? How will you prevent malformed data entering your training?
  2. Data formating - is all of your data in a format that your ML model will understand? Typical examples could be CSV or Parquet format for example. Are all of the data columns matching across the dataset? These are all key considerations to take into account.
  3. Data labelling - are all of your datasets labelled correctly and is your labelling strategy correct for your current and if possible future use cases?

Typically the above steps will consume between 60-80% of the project lifecycle, so be sure to spend your time wisely as the age old adage:


Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

Abraham Lincoln


AWS offer a great service called AWS Glue Studio, which makes it relatively easy to automate the ETL (Extract, Transform and Load) process of your data. My particular use case is relatively straight forward; at the end of a users journey, key telemetry is sent back to our core datastore where a glue ETL process will run a batch job on a schedule to combine these smaller csv files into one large training csv that will be used by Sagemaker for training our predictive models. (It's a shame that AWS Sagemaker Autopilot doesn't accept Parquet files as these as smaller and faster for training - so CSV is our go to).

Glue Studio Job

The glue studio editor does a good job of getting you started, especially if you have little to no experience of handling data or ETL processes, but there are some quirks to be aware of if you want to use AWS Sagemaker Autopilot, I'll cover them in another post.

aws
etl
aws-glue