August 31, 2024
ScienceMaking ML-ready datasets for Earth Observation is tricky. Here are some tools to help you get started.
Anyone who can code can make predictions with a machine learning (ML) model. But you need to know more than PyTorch or Scikit Learn to do ML. You need to know how to evaluate models and design experiments. In this blog, you’ll learn about the basics of experimental design and variance reduction.
Outline:
An experiment is an artificial setting where you vary one or more independent variables to measure the effect on a dependent variable [1]. The independent variables are your setup: the model, data, and hyperparameters like learning rate. The dependent variable is your target metric (like the loss).
A controlled experiment is a special type where we only change one variable at a time. Controlled experiments cut down on the noise and highlight factors influencing our results. The definition of “one variable at a time” is a bit loose in practice. Comparing models with different hyperparameter settings still counts. Optimising the hyperparameters of each model is even crucial for fair comparison. Just keep the data, resources, etc., equal.
Published papers only list a small subset of all the experiments needed to develop the model. Experiments are not just for evaluating our model; experiments help us develop models. We need to run many experiments to understand the task and build a good model. The main types of experiments are:
Controlled experiments remove confounding factors to measure the effect of a specific variable—for example, the effect of the learning rate. But the output of your experiment won’t be the same every time. Think of flipping a coin: you can flip it the same way 10 times in a row, but it’s unlikely it will land on the same side. You can capture the probability of getting heads or tails in a distribution, but you can’t say for sure what the result of the next toss will be. Such processes are called stochastic or non-deterministic.
Many ML algorithms are stochastic—for example, different weight initialisations cause identical neural networks to get different performances. You may get a lucky result once, but the model can get worse the next time you train from scratch.
These performance fluctuations are also called variance. Variance is the tendency to learn random things irrespective of the real signal [2]. There are two primary sources of variance:
In ML experiments, variance hurts reproducibility. You don’t want to get different results if you re-run your experiments. But we can learn a lot from variance, too. So good experimental design aims to capture the data and model variance rather than hide it.
There are lots of ways we can measure variance in ML results. The red thread through all of those is to repeat runs. So, instead of having a single result per configuration, you get a distribution of results. You want to find a distribution of performance that is as close as possible to the “true” underlying distribution. Techniques to do this are called variance reduction in statistics.
If you think statistics are boring, I have good news for you. You’re likely already applying variance reduction in your experiments. Two very common ML techniques are variance reduction: stratified sampling and cross-validation.
Stratified sampling is a technique to reduce the differences between the train and test sets. In ML, a stratified split means the class balance is the same in train and test. It’s super easy in sklearn:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)
Stratified sampling can improve your test results and make them more consistent. The standard is stratified based on class. But you can get creative and use other data attributes (e.g., the date or location) to stratify data.
Cross-validation is a technique where you split your training data into k chunks or folds. Then you train your model on the first k-1 chunks and evaluate it on the remaining chunk of data. You repeat this k times so you’ve evaluated your data on each of the chunks.
Cross-validation measures your model’s sensitivity to small differences in the data. It’s a very data-efficient way to quantify variance in your ML experiment. However, the downside is that you need to train your model many times. As a result, cross-validation is not used a lot with neural networks. But if you can afford it, it’s a great way to squeeze extra juice from smaller datasets.
Most textbooks and courses explain important ML algorithms but don’t always cover experimental design. But, we cannot design good ML models without solid experimental design. Many principles of experimental design come from statistics. Lots of people think stats are boring, abstract and hard. But the statistics we need for ML really aren’t that difficult. And statistics is our friend because good experimental design saves us time in the long run.
Though ML changes and grows like crazy, experimental design principles haven’t changed that much. I’ve found that older papers and editorials on this topic are great resources (and fun time capsules). Here’s some of my favourites:
More practical resources are these great introductory blogs:
Finally, getting a good textbook pays off if you’re serious about statistics. I used and loved this one in my bachelor’s:
Please reach out if you have more questions about experimental design!