November 10, 2024
9 tips for data version control in large projects
Data version control (DVC) is a powerful Python library for managing data processing steps in ML pipelines. It is actively maintained and has extensive documentation. Still, it took me over a year to finally get it working. Realistic pipelines are often more complicated than tutorial examples. In this blog, I share nine tips for setting up DVC in projects with lots of data.
Outline:
- Why version your data?
- How does DVC work?
- How to version data with DVC
- 9 tips for setting up DVC
- Configuring cache
- Remote storage
- Adding data
- Updating data
- Further reading
- Experiment tools in research
Have you ever had to rerun experiments because you forgot to rerun a data pre-processing step? I need multiple pre-processing steps: filtering, cropping, normalising, splitting, etc. Sometimes, creating ML datasets is just as complex as designing the model. Most of us already use tools like TensorBoard and Weights & Biases to keep track of model design. So, what about our data?
The problem with versioning data is that we cannot push large files or thousands of images to our GitHub repositories. But nothing is as frustrating as discovering that you ran your experiments with an outdated or incorrect data version. You can sort of version data by updating the filename: appending “normalised” or “cropped.” This mechanism has failed me many times, and I, for one, am tired of discarding results because of data problems!
Versioning your data can solve many problems, especially if you link to code versions. Imagine an automated way to say: this data was made with this normalisation procedure and this filtering step. It removes reliance on hardcoded and, let’s face it, arbitrary filenames. As a result, your code will be less bug-sensitive and more reproducible.
As I mentioned, versioning data is harder than versioning code because data files are much larger. Moreover, most of the time, we don’t just want to know the version of input, intermediate and processed data but also how they are linked. Many data processing bugs stem from forgetting to re-run some pipeline parts after changing them. DVC uses a simple trick to assign versions and links to large files. When you track a piece of data, DVC creates a unique string based on the file's contents. Just a single change to the file can change the string completely. This process is also called hashing. The links between data and scripts are defined in YAML files and stored in your repository and pushed to git.
With these two types of files, a single status check can show you if you:
- Implemented a new data split but forgot to run it. (script updated, data outdated)
- Re-calculated the training set means but didn’t normalise the data. (intermediate data updated, processed data outdated)
Versioning data in DVC goes in three steps:
: compute hashes for the data and save them in files tracked by git. You can add data to the project a few different ways (see Adding data).dvc add
: move or copy the data to the cache and link the data to the working directory.dvc commit
(optional): upload data to remote storage for backup or collaboration.dvc push
DVC vs git
The idea is simple. But if you have a lot of files, you may not be able to follow the DVC tutorial step by step. If you don’t do it right, it can be very slow or waste space. DVC has all the functionality you need to prevent this, but it can be tricky to find in the documentation. (It took me a lot of trial and error). I give you nine tips you should read before adding DVC to your own project.
My tips supplement the official DVC tutorial: “Get Started: Data Pipelines.” The tips are in chronological order and help you:
- Set up cache
- Decide on remote storage
- Efficiently add data
- Update data
Before you add any data, consider how to set up your cache. When you commit data, DVC moves it to the cache, which can grow large and hold all the committed versions of your datasets.
Tip 1: Change cache location
By default, DVC saves the cache in your home directory. You can change the default location to a different path if your home directory does not have enough space. You have two options to configure the cache location:
- Globally. Git pushes the configuration to your repository. Anyone working with that repo will commit data to the same cache. Use when collaborating on the same machine to avoid storing multiple copies of the cache.
- Locally. This private DVC configuration isn’t pushed to GitHub. Use it when accessing the repository from machines with different paths to the data partition.
You can update the configuration using the DVC CLI:
# global cachedvc config cache.dir <filepath> # local cachedvc config –local cache.dir <filepath>
Tip 2: Store small files
DVC doesn’t keep duplicates of your data. It uses the unique hash to check if the data is already in the cache. However, if you make just one small change in a large file, the hash will change. As a result, DVC will store a copy. So, saving your data in multiple small files is better. DVC will only save duplicates of individual files that have changed instead of the entire dataset.
Having many small files does create some problems with using remote storage, so you’ll need to weigh the pros and cons.
DVC can download and upload cached data to remote storage. It’s a useful single source of truth when collaborating and a backup, too. The
command uploads data from the cache to the remote location and pulls downloaded remote data into the cache. Alternatively, you can use the
command to add data directly to the remote location without caching it.
Tip 3: You don’t always need remote storage
DVC has many options for adding remotes. However, I don’t think using a remote is always necessary or the right choice. The problem is how the data is pushed to the remote. DVC reviews each file one by one and then communicates with the remote storage to check if it needs to be uploaded. I used WebDAV to synchronise a few thousand images with SURFDrive (part of the national Dutch universitary compute services). It was so slow the operation timed out!
You can speed pushing and pulling up by reducing the number of files, for example, by creating tarballs. However, then you run into the cache problem with large files. I don’t collaborate on my code, so remote storage wasn’t worth the hassle for me.
My 'experiments' with remote storage
The DVC tutorial shows how to track a single .csv dataset. We need some tricks to add data in realistic projects with many more inputs, intermediate data and outputs.
Tip 4: Edit dvc.yaml to track data
There are three ways you can track data:
: add a “unit” of data to the versioning system. This can be an individual file or a whole folder. DVC assumes this is just one brick of data that is always created at the same time.dvc add
: add a stage to the versioning system, linking inputs and outputs to a specific script. This approach gives DVC more context about how you created data. DVC saves the dependencies in a dvc.yaml file containing all the stages.dvc stage add- Manually edit
: this is the most flexible option. While dvc stage add is already pretty powerful, you can also manually add parametrised stages to the dvc.yaml file. For example, run the same script with different arguments, or plot results.dvc.yaml
The best option is to manually edit the
file. Versioning will work as it should and dvc status reports will be more granular.
Tip 5: Don't add all data in one go
It’s fine if you still want to use
. However, don’t be tempted to add your entire data directory in one go. For example, if you already have an existing project with some data files:
- data/
- raw_data/
- normalised_data/
- splits.csv
If you run
, DVC assumes all files in this folder form a single unit that is generated and changed together. So if you change a single file, DVC will cache a copy of all your data. It also undos the advantage of using dvc repro to automatically check which stages have to rerun and reproduce your dataset, because DVC won’t be able to detect intermediate outputs.
Tip 6: You can only add data from the working directory
I store code in the home directory and data in a larger partition (like the
partition). However, DVC only tracks data added from the working directory. So, you need to move your data to the working directory. But remember, this is only temporary because
moves data to the cache dir. You can move and commit data in chunks if your homedir has very little space.
Tip 7: Create first, track later
Another thing to remember when adding data: you can only add already existing data. So first run your data processing script, then add the outputs to DVC. There used to be a command called
that would run and then automatically add the versioning to the data, but, that doesn’t exist anymore. Instead, dvc repro runs data processing pipelines, but this only version already tracked data.
You update data as you test, debug, and fine-tune your processing in any standard ML setting. I recommend only tracking data once you’re reasonably sure it’s in a good state to avoid storing incomplete data in cache. Your data will still change, though. Either you find a bug, or something to improve.
DVC protects tracked data so you can’t just overwrite it. I explain how to make changes to tracked data in two scenarios:
- Updating data
- Replacing data
Tip 8: Unprotect data to update it
To update an existing file, run:
dvc unprotect <filepath>python <yourscript.py># if adding data with dvc add. You don’t need add when using dvc.yamldvc add <filepath>dvc commit -m “<your commit message>”
removes the protection from the data so you can edit it. However, don’t forget to commit the changes, later!
Tip 9: Remove data to replace it
To replace a file or folder, run:
# you can remove individual files, or whole stagesdvc remove <filepath or stage name># if replacing a directory, I recommend to first remove it so you dont accidentally keep old filesrm -rf <filepath>python <yourscript.py># if adding data with dvc add. Else, you need to manually add the stage to the dvc.yaml file againdvc add <filepath>dvc commit -m “<your commit message>”
removes the stage from dvc.yaml file or removes the DVC file with hash created by running
. Replacing versioned data is a little annoying when you add data by editing the
file, because after removing the data, you need to manually add the stage again. Moreover,
doesn’t work if you use foreach stages. A workaround is to manually remove the stage from
and
. Personally, I just stopped using foreach stages.
Use
to automatically re-run outdated pipeline stages if you want to avoid all this manual editing.
- The official DVC tutorial: good start, but only shows a very simple case
- Data Version Control (Software) - Wikipedia: gives a great high-level overview of DVC concepts and functionality.
- Data Version Control With Python and DVC: in-depth tutorial focussed on setting up DVC for collaboration. It explains the difference between DVC and git very well. However, it’s based on an older version of DVC so some commands are outdated.
- What Is Data Version Control (DVC): explains hashing, switching between data versions and connecting to
using a simple dataset as example.aws s3
DVC can do more than data pipeline versioning:
- Experiment tracking, including comparing and plotting results.
- Run parameter grids.
- Model versioning.
Another big question is how to make it work with slurm, the only way for me to access serious compute. Some libraries connect to commercial services like aws, but what if your group has its own cluster? And is the library able to deal with the parallelism that comes from launching many slurm jobs at the same time? For example, I haven’t yet figured out if starting parallel
runs can mess up the cache.
Some people propose custom solutions, like this wrapper around DVC to submit jobs to slurm. However, they have their own challenges. This particular package assumes you launch slurm jobs over SSH. None of the academic clusters I know work like this: you always submit jobs from the head node. Furthermore, installing extra packages means more potential python package dependency conflicts and more time investment in learning to use the packages.
I’m hopeful that the experiment tools (=MLOps) landscape will improve for researchers. Fora and Github Issue pages are full of researchers asking “ok but how does it work with slurm?” More and more researchers that create their own toolboxes, like NASLib and OpenML. At some point, commercial software will pick this up and start designing for us, too.