
August 31, 2024
Satellite data is one of the most-used data types in Geospatial Artificial Intelligence (GeoAI) and Machine Learning for Earth Observation (ML4EO). In a previous blog, I introduced the ML4EO analysis pipeline, which consists of three stages: data pre-processing, modelling, and post-processing or analysis. Working with satellite data for the first time can be overwhelming because almost every step of the pipeline is affected by the properties and challenges of the data we work with. The goal of this blog is to deepen your understanding of satellite data so you can develop ML4EO pipelines more effectively.
I’ll cover:
Photo Credit: Me
The three stages and substeps in ML4EO pipelines.
Satellite images are part of a larger category called remote sensing data. Remote sensing data is any data obtained from a distance, such as images from satellites, planes, or drones. It can also include measurements of the sea floor taken by a ship floating on the surface.
Most Earth Observation satellites create images for weather, environmental monitoring, and military applications. Measurement can be passive (simply catching light on a sensor like cameras do) or active: emitting light and measuring how much of it is reflected. Finally, satellites operate in different wavelengths. Each wavelength can tell us different things about the Earth’s surface because different surface features, such as vegetation and water, absorb and reflect light in different wavelengths.
Photo Credit: JAXA
Furthermore, satellites can be launched into different orbits, each with advantages and disadvantages. Polar orbits provide global coverage; geostationary orbits ensure the satellite always stays in the same place, and sun-synchronous orbits ensure the satellite always passes over simultaneously.
Getting data from a satellite is much more complicated than snapping a picture with your phone and loading it on your computer. After a satellite completes a measurement (like capturing light in a specific wavelength range), the measurements are downlinked or downloaded to ground stations, where the data is processed. Then, the data is uploaded to a digital archive, where you can access it.
Photo Credit: NOAA/NESDIS
Satellite data is processed in multiple steps. This process takes multiple hours. It’s not available right away. Some satellites offer “*near real time” *products that are processed very fast and thus accessible quickly.
Many satellite data platforms allow offer different processing levels. For example, NASA uses a four-level system to describe different stages of data processing. Because many satellites are international projects, other agencies, like ESA, often adopt these same levels.
The processing levels are:
When the data is processed, it’s uploaded to servers and made available to users through data platforms. Some are open and freely accessible, like the Copernicus Browser. In other cases, like PlanetScope data, you need a license, or you have to buy the images.
In the satellite browser, you can choose:
When you download satellite data, you’ll likely notice two things:
You cannot store this data as a JPEG or PNG because those file types don’t allow you to encode all the textual metadata or the coordinates. It is more common to use file types like GeoTiff or NetCDF to save the metadata with the raster.
This file format is convenient because you do not need to match up lots of separate images and CSV files. However, the metadata in the files is still relatively brief. If you want to know more about the algorithms used in the data processing, which external data was used, etc, you need to go to the website of the satellite and find the documentation. For instance, here are the docments for Sentinel-5P.
Satellite data you download from portals is completely unlabelled and un-curated. However, creating a machine learning dataset requires a lot of effort. So, most of the time, you’ll be using datasets created by other people. You can find a lot of datasets on Kaggle in survey papers, and there are even more and more benchmarks that gather multiple EO benchmarks into a single library with a single interface.
No standardised format exists for GeoAI datasets, even when they’re ML-ready. When you download data from a portal, you often get a very large image tile that is usually way too large for machine learning. Therefore, you’ll often see that people use image patches or pixels instead. Patches are just crops of a tile, and pixels speak for themselves. ML datasets mainly offer full tiles or patches. They are sometimes already converted to JPG for training. However, I save data as NetCDFs because all the extra metadata is useful for plotting results (you have the coordinates and timestamp of each image) or even for expanding the dataset with new images.
An intuitive way to describe differences between satellite images, apart from the physical quantity they measure, is through spatial, spectral, temporal and radiometric resolution.
[0,255]
. The higher the radiometric resolution, the more options you have, so satellite images aren’t necessarily saved in this format. Instead, many pixel values are saved in other units, like the digital number, reflectances or radiances.Satellite data products are continuous services because the satellites don’t stop orbiting the Earth. But sometimes, problems occur when creating an image, leading to missing data. There are two types of missing data in satellite data:
Satellites collect data non-stop, but there can be holes in data availability on the platforms: whole images are missing. There can be a few days or weeks without data when there is a problem with the satellite or another data source used in the data processing. The data processing isn’t restricted to the measurements of the satellites. For example, you use the cloud mask from the data product of a different satellite (like VIIRS) to remove pixels covered by clouds. It can be necessary to exclude an image altogether if this supporting data is missing.
In other cases, individual image pixels can be missing because of factors like cloud cover, but low-quality pixels can also be removed as part of data processing. For example, a pixel can be removed when there is a large uncertainty in the processing output. Sometimes, it is not possible to get reliable data. For example, there are no methane concentration measurements above the water in the Sentinel-5p product (because of how the water reflects light in the relevant wavelengths), and high mountains can be another cause of missing data because the large height differences can lead to uncertainties in the measurements. These missing pixels aren’t distributed uniformly at random and therefore cause sampling bias.
In the next blog, I’ll cover common satellite data preparation steps for ML and a list of tools to make your life easier. In the meantime, you can check out this series of blogs on designing ML4EO pipelines to read about using satellite data in practice.