Setting Up a Machine Learning Pipeline For FREE

Setting Up a Machine Learning Pipeline For FREE

With Strings Attached, Of Course

Recently I needed to set up a machine learning pipeline for my project, Camera to Keyboard, and since it's an open source project I needed a way to set up a pipeline for free. In this article, you'll read about my approach and its constraints.

A machine learning pipeline is a series of interconnected data processing and modeling steps designed to automate, standardize and streamline the process of building, training, evaluating and deploying machine learning models.



There are several limiting factors that you have to consider before choosing this approach. Especially that we'll be using GitHub Actions for the training process. In short:

  • Training the model on CPU has to finish in less than 6 hours

  • There's a limit on how many times you can download the trained model (a few workarounds have been mentioned, though)

Training Constrains

As mentioned earlier, we're going to use GitHub Actions and each job in your workflow has a time limit of 6 hours. Moreover, your model's going to be trained on a CPU which is much slower than CUDA. GitHub, however, has started offering GPU enabled actions in private beta to Teams and Enterprise accounts (at the time of this writing). So whether it will be free for public repositories or not will remain to be seen (highly unlikely).

Download Constraints

For storing the trained model, I'm using AWS S3's free tier, which offers:

  • 5GB of storage

  • 100GB of data transfer per month

  • 20,000 GET requests per month

The 5GB storage is fine, you most probably don't need to keep all older model versions. But the other 2 factors need to be taken into account.

Furthermore, your bucket has to be public. You can allow public reads using the following bucket policy, whiling requiring authentication for writing:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "PublicList",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
            "Sid": "PublicRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"

The Use Case

The 2 key factors for my use case are that I train an object detection model and I don't have frequent dataset changes (if you do, read on till the end of the article). The dataset is also not large, so I can get away with storing the training data in the repository and I won't even need to use git LFS since each file is pretty small.

The Pipeline

Here's an overview of how the pipeline works:

  1. New training data is committed into the repository

  2. The GitHub Action checks for changes. If any, will train the new model

  3. The trained model is uploaded to S3

  4. Every time my app runs, it will check for a new version of the model and if one exists, it will be downloaded and used

For detecting changes in the dataset, I initially went with checking the current git commit for changes in the dataset directory, which can be done using the following command:

# with the --quiet flag, git exits with code 1 if there are changes
git diff --quiet HEAD~1..HEAD dataset_dir || echo 'changed'

Ultimately I went with a more robust approach though, which is calculating the checksum of the dataset using md5. Yes, md5 is not secure, but the only concern is a collision and the chances of that are as high as winning the lottery (i.e. none of them are going to happen). But if it happens, feel free to use sha512 ¯\_(ツ)_/¯.

What about rollbacks, I hear you say? That's an excellent question. In case the performance of your model depreciates, for example, all you need to do to rollback is to revert the git commit that added new data to the repository and delete the trained model from S3 (if it's already uploaded). Although, this last step could be automated. You can have another workflow that checks for revert commits and if it involves your dataset, deletes the relevant version from S3.

Let's go over the solution in detail now. I will not paste all the code here though, as it will make the article too long, and they're already available publicly. I will however link to the relevant files so that you can easily refer to them.

The Trainer

First off, I have my trainer class that takes care of the training:

# When instantiating the trainer, you can specify where the trained
# model should be copied to. That will allow the trainer to be used
# both in CI, and when running the trainer locally, for instance using
# `python train`

target_dir = os.path.join(tempfile.tempdir, 'myproject')
trainer = Trainer(config, target_dir) # Runs the actual training process

# You can also get the current model version, or the next version
# to be exact, if it hasn't been trained yet 

The CI

Now, for the CI action, I opted to have an accompanying python script. That just makes life easier and will keep the workflow simple. You can check out the files here:

The workflow has 5 steps:

  1. source checkout

  2. configure-aws-credentials: Get credentials to make requests to s3 (required for the next step)

  3. Train: Calls the train function in Before training, though, it checks whether the current version has already been trained and uploaded to S3.

  4. configure-aws-credentials: Again, yes. In my case, training would take more than an hour, which is the default expiration time of the AWS token. And alternative to getting the credentials again is to set the Credential Lifetime parameter.

  5. And finally, upload the model to S3 by calling the upload_model function in

Integrating the Pipeline Into the App

Now's the time to reap the rewards of the pipeline. We can simply get the objects in our s3 bucket, find the latest one based on LastModified, and check if it has already been downloaded or not. If not, download it! Here's the implementation of that class:

Final Thoughts

There are a lot of improvements that can be made here. To name a few:

  • Training the model from scratch, every time, is redundant and a waste of time. Especially if it takes a long time. Again, I won't have frequent dataset changes, but if you do, consider saving your checkpoints and resume training with the new data, while keeping an eye out for over-fitting.

  • If you have a large dataset that just can't be trained within 6 hours on a CPU, you can alternatively spin up a remote node (say an EC2 instance with GPU) and train and upload your model on that instance.

  • If the free tier of S3 isn't enough for you, consider alternative storage options. For instance:

    • Cloudflare R2 has a more generous free tier and its API is S3 compatible.

    • You might even get away with using Google Drive or Dropbox. I have not explored these options though and don't know for sure if they're feasible or not.

  • And finally, regarding the versioning system, I'm still not sure if that's the best idea. It has its own merits, but maybe just following a semantic versioning and tagging the models with the commit IDs that introduce changes to the model (for rollbacks) is a more solid approach. It all depends on your use case, though.