Photo Credit:


Posted: February 28, 2025

Introduction to satellite data for Machine Learning and GeoAI

Satellite data is one of the most-used data types in Geospatial Artificial Intelligence (GeoAI) and Machine Learning for Earth Observation (ML4EO). In a previous blog, I introduced the ML4EO analysis pipeline, which consists of three stages: data pre-processing, modelling, and post-processing or analysis. Working with satellite data for the first time can be overwhelming because almost every step of the pipeline is affected by the properties and challenges of the data we work with. The goal of this blog is to deepen your understanding of satellite data so you can develop ML4EO pipelines more effectively.

I’ll cover:

Photo Credit: Me

The three stages and substeps in ML4EO pipelines.

What is satellite data?

Satellite images are part of a larger category called remote sensing data. Remote sensing data is any data obtained from a distance, such as images from satellites, planes, or drones. It can also include measurements of the sea floor taken by a ship floating on the surface.

Most Earth Observation satellites create images for weather, environmental monitoring, and military applications. Measurement can be passive (simply catching light on a sensor like cameras do) or active: emitting light and measuring how much of it is reflected. Finally, satellites operate in different wavelengths. Each wavelength can tell us different things about the Earth’s surface because different surface features, such as vegetation and water, absorb and reflect light in different wavelengths.

Image showing what can be observed in different wavelengths

Photo Credit: JAXA

Furthermore, satellites can be launched into different orbits, each with advantages and disadvantages. Polar orbits provide global coverage; geostationary orbits ensure the satellite always stays in the same place, and sun-synchronous orbits ensure the satellite always passes over simultaneously.

From measurement to image

Getting data from a satellite is much more complicated than snapping a picture with your phone and loading it on your computer. After a satellite completes a measurement (like capturing light in a specific wavelength range), the measurements are downlinked or downloaded to ground stations, where the data is processed. Then, the data is uploaded to a digital archive, where you can access it.

Visualisations of steps from satellite measurement to satellite image

Photo Credit: NOAA/NESDIS

Processing levels

Satellite data is processed in multiple steps. This process takes multiple hours. It’s not available right away. Some satellites offer “*near real time” *products that are processed very fast and thus accessible quickly.

Many satellite data platforms allow offer different processing levels. For example, NASA uses a four-level system to describe different stages of data processing. Because many satellites are international projects, other agencies, like ESA, often adopt these same levels.

The processing levels are:

  • Level 0 (L0): raw, unfiltered data
  • Level 1 (L1): data is filtered and annotated with meta-data like the date and time. Sometimes, the data is converted into other units. For instance, level 1 Sentinel-5P data are spectra measured by the TROPOMI instrument.
  • Level 2 (L2): geophysical variables are derived from the raw data. For instance, Sentinel-5P L2 data are concentrations of gases calculated from the L1 spectra.
  • Level 3 (L3): Level 2 data averaged over time and/or space (less missing data, lower uncertainties)
  • Level 4 (L4): results of analyses from using lower-level data, like your ML outputs!

Satellite data platforms

When the data is processed, it’s uploaded to servers and made available to users through data platforms. Some are open and freely accessible, like the Copernicus Browser. In other cases, like PlanetScope data, you need a license, or you have to buy the images.

In the satellite browser, you can choose:

  • Date
  • Region
  • Filter by variables like cloud cover
  • Processing level
  • Satellite

Satellite data formats

When you download satellite data, you’ll likely notice two things:

  • The image can have more than three channels, unlike the RGB images we’re used to. Often, each channel represents a wavelength band. A band records how much light is reflected in a specific wavelength range. The narrower the range is, the more you can differentiate different materials and plants. Therefore, many optical satellites have more than three bands. For instance, Sentinel-2 has 13. On the other hand, Sentinel-1, a radar satellite, has just one band.
  • Apart from the image, the files also describe a lot of extra information, such as the date and time the data was recorded, the processing level, spatial coordinates, units, etc. The metadata can also be rasterised, so one of the channels in your data cube can be a cloud mask from a completely different satellite.

You cannot store this data as a JPEG or PNG because those file types don’t allow you to encode all the textual metadata or the coordinates. It is more common to use file types like GeoTiff or NetCDF to save the metadata with the raster.

This file format is convenient because you do not need to match up lots of separate images and CSV files. However, the metadata in the files is still relatively brief. If you want to know more about the algorithms used in the data processing, which external data was used, etc, you need to go to the website of the satellite and find the documentation. For instance, here are the docments for Sentinel-5P.

More information about file formats

Satellite data formats in ML datasets

Satellite data you download from portals is completely unlabelled and un-curated. However, creating a machine learning dataset requires a lot of effort. So, most of the time, you’ll be using datasets created by other people. You can find a lot of datasets on Kaggle in survey papers, and there are even more and more benchmarks that gather multiple EO benchmarks into a single library with a single interface.

No standardised format exists for GeoAI datasets, even when they’re ML-ready. When you download data from a portal, you often get a very large image tile that is usually way too large for machine learning. Therefore, you’ll often see that people use image patches or pixels instead. Patches are just crops of a tile, and pixels speak for themselves. ML datasets mainly offer full tiles or patches. They are sometimes already converted to JPG for training. However, I save data as NetCDFs because all the extra metadata is useful for plotting results (you have the coordinates and timestamp of each image) or even for expanding the dataset with new images.

How to describe differences between satellite images: resolution

An intuitive way to describe differences between satellite images, apart from the physical quantity they measure, is through spatial, spectral, temporal and radiometric resolution.

  • Spatial resolution: determines how much detail you can see. For instance, Sentinel-2 bands have a resolution of 10 m per pixel, so counting cars is impossible. However, drone and aerial data can get down to a few centimetres per pixel, making it possible to do very detailed work. The resolution can differ per band. Lower spatial resolution means the image covers a wider area, and the image shows a lot of small features, e.g. houses and fields all look very small. In higher-resolution images, these features naturally take up more pixels, so features like cars become much larger. Still, features in satellite images tend to be significantly smaller than in natural images and often take up a few per cent of the image, if even that.
  • Temporal resolution: the time between two images of the same spot. Sentinel-5p is in a sun-synchronous orbit and images the whole earth every day, but the revisit time of the pair of Sentinel-2 satellites is 5 days.
  • Spectral resolution: how broad the wavelength ranges are of the bands. Sentinel-2 has 13 bands, a much higher spectral resolution than standard RGB images. Still, there are also satellites, so-called hyperspectral imagers, with hundreds of bands that make it possible to distinguish between many types of materials.
  • Radiometric resolution: this is like the sensitivity threshold of the satellite: how many decimals behind the zero can it still detect differences? This is usually expressed in bits. RGB images are 8-bit, but many satellites are much more sensitive and produce images that are up to 16-bits. As a result, while the pixel values in natural images are saved in ranges of
    [0,255]
    . The higher the radiometric resolution, the more options you have, so satellite images aren’t necessarily saved in this format. Instead, many pixel values are saved in other units, like the digital number, reflectances or radiances.

Missing data

Satellite data products are continuous services because the satellites don’t stop orbiting the Earth. But sometimes, problems occur when creating an image, leading to missing data. There are two types of missing data in satellite data:

  • Holes in data availability: whole images missing
  • Missing pixels

Satellites collect data non-stop, but there can be holes in data availability on the platforms: whole images are missing. There can be a few days or weeks without data when there is a problem with the satellite or another data source used in the data processing. The data processing isn’t restricted to the measurements of the satellites. For example, you use the cloud mask from the data product of a different satellite (like VIIRS) to remove pixels covered by clouds. It can be necessary to exclude an image altogether if this supporting data is missing.

In other cases, individual image pixels can be missing because of factors like cloud cover, but low-quality pixels can also be removed as part of data processing. For example, a pixel can be removed when there is a large uncertainty in the processing output. Sometimes, it is not possible to get reliable data. For example, there are no methane concentration measurements above the water in the Sentinel-5p product (because of how the water reflects light in the relevant wavelengths), and high mountains can be another cause of missing data because the large height differences can lead to uncertainties in the measurements. These missing pixels aren’t distributed uniformly at random and therefore cause sampling bias.

What's next?

In the next blog, I’ll cover common satellite data preparation steps for ML and a list of tools to make your life easier. In the meantime, you can check out this series of blogs on designing ML4EO pipelines to read about using satellite data in practice.

Further reading


You may also like...

← Back to Blog