How to design ML4EO pipelines (Part 1)

Many great blogs give a general overview of how to apply Machine Learning (ML) to satellite images. But none of these prepared me for all the steps you need to climb to design ML for satellite data. So, in this blog, I share some challenges you can expect when working with satellite images as an ML engineer.

What is an ML4EO pipeline?

This blog introduces how to build ML4EO pipelines. ML4EO pipelines are just ML projects but with EO data. I call them pipelines because, most of the time, you need multiple pre- and post-processing steps. First, to get the EO data in an ML-ready format. Second, to convert the ML-predicted scores into human-readable output.

Very broadly, the steps in an ML pipeline are:

Obtaining the (raw) data
Pre-processing the data
Labelling the data
Developing an ML model
Deploying an ML model and producing test predictions
Post-processing of model output
Analysis/final results

These steps are different for every task and satellite image dataset. I learned that you can only predict the exact steps in your path with some basic knowledge about the data.

My path to EO

Earth Observation (EO) drew me in. Applications like deforestation monitoring resonated with me. The different data types intimidated but also intrigued me. I loved ML, but I didn’t want to classify handwritten digits. I wanted to solve real problems.

Also, I loved the pretty pictures. My background is in astronomy, so what can I say? I appreciate beautiful pixels.

I dove straight into satellite imagery in my master’s thesis. Except for my bachelor’s background in astronomical image processing, I knew nothing about satellite images. As a result, ML4EO pipelines were vague concepts to me. I had no clue what I would do except the broad steps described in the previous section.

All formal education I got was a short workshop on processing raster data. I learned to work with satellite images and work on EO applications on the job. While sometimes overwhelming, this hands-on approach taught me a lot in 3 years.

Why knowing challenges is useful

I didn’t know which challenges I could expect working with satellite images. I developed my pipeline like walking through the fog: only seeing the next 2-3 steps in front of you. At every step, I would discover new problems to solve. Expecting challenges means you can list concrete steps to your goal right after reading about a new ML technique. You’ll know which techniques won’t (directly) work with your data. Like knowing that some data augmentations would break your labels.

The same holds when you change data instead of techniques. With every experience, you learn to look a bit further ahead. With enough experience, the fog goes away. You'll see the horizon and make a reliable plan for your whole pipeline.

With enough experience, you can avoid some problems before you even start. Predicting challenges isn’t just about avoiding problems. It’s also about identifying new research questions that are interesting to solve. Without challenges, you can’t design new algorithms.

I noticed that blogs and guides on ML4EO focus on ready-to-use techniques and datasets. As a result, those pipelines were pretty simple: you barely had to touch your data. But what do you do if you cannot use those? What does designing a new ML4EO pipeline look like? Which skills do you need; which tools do you need to learn to use? Which challenges can you expect along the way?

Hands-on experience is the best way to understand ML4EO pipeline design. The pipelines vary too much from dataset to dataset to give a foolproof recipe. Even my blog about ML4EO datasets is pretty high-level: it doesn’t go into the preprocessing steps you will inevitably need to do to make your dataset. I can’t help you get your own experience, but I can give you a starter by sharing my experiences in my ML4EO projects.

Miniseries outline

This blog is the first in a 5-part mini-series on the challenges in developing ML4EO pipelines. I write it from my perspective as a machine learning engineer. This won’t be a blog series about choosing datasets or training deep learning models. It won’t be a blog where I explain experimental design, either. My goal is to help you prepare and feel more confident about your first (or one of your first) ML4EO project(s). I’ll write about concrete steps in ML4EO pipelines and what I learned about EO and collaborating with EO scientists.

The following blog posts will cover three projects in chronological order. I'll describe a part of that project’s pipeline in detail in each blog. I'll focus on data processing because that’s what was newest to me as an ML engineer. I will mention a lot of different techniques but won’t explain them in detail. The point is to get a clearer picture of ML4EO pipelines as a whole.

The final blog in the series will be more high-level. It'll summarise the main roadblocks in my ML4EO journey and what I learned about the difference between ML and EO.

Series outline: