Productionise your Data Science or Machine Learning code

My current role as a Lead AI Engineer usually revolves around leading teams of data scientists to productionise and scale their models so that they can be applied for larger datasets, more markets and be democratised and used more easily by more members within their company.

Sometimes I come into the team at a point where there already is a Proof of Concept living in a Jupyter Notebook somewhere, and other times I’m part of the team from day 1... Read more

Building a generic web crawler/spider in Python

I’ve been doing web crawling and web scraping for years. It’s funny but it was actually the interest in web crawling that introduced me to Python as a programming language and made me give up my precious PHP which was what I was coding with earlier in my career.

Python simply had much better tooling, and the language itself was more suitable for building background processes and applications that was not just websites depending on a HTTP request for execution... Read more

How to make Singleton objects in Python

Singletons are a quite common pattern in software engineering that means that you create a class that can only be instantiated once, there can ever only exist a single instance of the class.

Why would this be useful? Well in one of my recent projects that I worked on, I built a Context class that would hold meta data about the current execution of a pipeline. It included things like unique id’s, timestamps and more.

It was very... Read more

How to make Lazy objects in Python

Lazy evaluation is an interesting topic that can be a little bit difficult to understand at first glance, it is quite common for me to have to help people who misunderstand when their code is executed – since they use lazy technologies such as Spark Dataframes or Django ORM.

Lazy evaluation can be explained quite simply as “Wait with evaluating value until its needed”.

What are Lazy Objects in programming?

Isn’t “Wait with evaluating value until its needed” clear... Read more

Publish your documentation to GitHub Pages from Jenkins Pipeline

A proper documentation is a must have for any project that is expected to be shared or handed over to someone else in the future. Personally, in all of my professional projects I stress the importance of starting with a documentation from the start of the project and populating it as we go along, instead of treating it as an afterthought and fill it out in the end before client handover.

Some people leave their documentation as README files in... Read more

A complete guide to CI/CD Pipelines with CircleCI, Docker and Terraform

Setting up a Continuous Integration and Continuous Delivery pipeline has become something that every project requires from the start these days to enforce high-quality code. I see it in all kinds of projects no matter if it’s for the web, data science, machine learning and so on.

With a CI/CD pipeline, you can enforce that any code that gets merged to your release branch and deployed into your production passes all tests and checks that you have configured such as... Read more

How to Write Unit Tests and Mock with Pandas

I’m spending a lot of my time working close to data scientists where I review their code, give feedback and attempt to guide them to follow the best software engineering practices. The point is to make sure that our projects have scalable, robust and well-tested code.

I was recently working on a project where I immediately pulled down the code repository that the data scientists were working on to run the tests locally, and to my surprise, I noticed that... Read more

Generate Python Unit Test & Flake8 Reports for JUnit

One of the core parts of your Continuous Integration Pipeline (CI) is often to generate reports that detail the status of unit tests, their coverage and reports covering the static analysis of the codebase.

Usually, CI/CD Pipeline tools such as Jenkins or CircleCI prefer the reports in a specific format, so that they can easily display the contents within their UI. This format is usually the “JUnit” format, which originates from the Java community.

This causes some issues for us... Read more

Why you should use Docker - A Complete Guide

Getting your docker image built and running is a great achievement by itself, but the world of Docker does not simply stop there. There are still plenty of useful things to learn and dig your teeth into when it comes to Docker, especially when it comes to optimizing your image.

So what is meant by “optimizing” in this context? In general, we talk about optimizing the image from 2 perspectives:

  • Optimizing Build Time
  • Optimizing Image Size

The first... Read more

How to Optimize Docker Images for Smaller Size?

Docker is one of those rare technologies that completely rock your world when you understand the benefits and usages of it. It is definitely the number one technology that I appreciate getting into the last few years, and these days I use it in every single Software Engineering project that I get into.

At first glance, Docker might seem a bit overwhelming because it seems like there is so much to it. You will quickly discover that it is made... Read more