Photo Credit:


Posted: January 4, 2025

9 tips for data version control in large projects

Data version control (DVC) is a powerful Python library for managing data processing steps in ML pipelines. It is actively maintained and has extensive documentation. Still, it took me over a year to finally get it working. Realistic pipelines are often more complicated than tutorial examples. In this blog, I share nine tips for setting up DVC in projects with lots of data.

Outline:

Why version your data?

Have you ever had to rerun experiments because you forgot to rerun a data pre-processing step? I need multiple pre-processing steps: filtering, cropping, normalising, splitting, etc. Sometimes, creating ML datasets is just as complex as designing the model. Most of us already use tools like TensorBoard and Weights & Biases to keep track of model design. So, what about our data?

The problem with versioning data is that we cannot push large files or thousands of images to our GitHub repositories. But nothing is as frustrating as discovering that you ran your experiments with an outdated or incorrect data version. You can sort of version data by updating the filename: appending “normalised” or “cropped.” This mechanism has failed me many times, and I, for one, am tired of discarding results because of data problems!

Versioning your data can solve many problems, especially if you link to code versions. Imagine an automated way to say: this data was made with this normalisation procedure and this filtering step. It removes reliance on hardcoded and, let’s face it, arbitrary filenames. As a result, your code will be less bug-sensitive and more reproducible.

How does DVC work?

As I mentioned, versioning data is harder than versioning code because data files are much larger. Moreover, most of the time, we don’t just want to know the version of input, intermediate and processed data but also how they are linked. Many data processing bugs stem from forgetting to re-run some pipeline parts after changing them. DVC uses a simple trick to assign versions and links to large files. When you track a piece of data, DVC creates a unique string based on the file's contents. Just a single change to the file can change the string completely. This process is also called hashing. The links between data and scripts are defined in YAML files and stored in your repository and pushed to git.

With these two types of files, a single status check can show you if you:

  • Implemented a new data split but forgot to run it. (script updated, data outdated)
  • Re-calculated the training set means but didn’t normalise the data. (intermediate data updated, processed data outdated)

How to version data with DVC

Versioning data in DVC goes in three steps:

  • dvc add
    : compute hashes for the data and save them in files tracked by git. You can add data to the project a few different ways (see Adding data).
  • dvc commit
    : move or copy the data to the cache and link the data to the working directory.
  • dvc push
    (optional): upload data to remote storage for backup or collaboration.
DVC vs git

The idea is simple. But if you have a lot of files, you may not be able to follow the DVC tutorial step by step. If you don’t do it right, it can be very slow or waste space. DVC has all the functionality you need to prevent this, but it can be tricky to find in the documentation. (It took me a lot of trial and error). I give you nine tips you should read before adding DVC to your own project.

9 tips for setting up DVC

My tips supplement the official DVC tutorial: “Get Started: Data Pipelines.” The tips are in chronological order and help you:

  • Set up cache
  • Decide on remote storage
  • Efficiently add data
  • Update data

Configuring cache

Before you add any data, consider how to set up your cache. When you commit data, DVC moves it to the cache, which can grow large and hold all the committed versions of your datasets.

Tip 1: Change cache location

By default, DVC saves the cache in your home directory. You can change the default location to a different path if your home directory does not have enough space. You have two options to configure the cache location:

  1. Globally. Git pushes the configuration to your repository. Anyone working with that repo will commit data to the same cache. Use when collaborating on the same machine to avoid storing multiple copies of the cache.
  2. Locally. This private DVC configuration isn’t pushed to GitHub. Use it when accessing the repository from machines with different paths to the data partition.

You can update the configuration using the DVC CLI:

# global cache
dvc config cache.dir <filepath>
# local cache
dvc config –local cache.dir <filepath>

Tip 2: Store small files

DVC doesn’t keep duplicates of your data. It uses the unique hash to check if the data is already in the cache. However, if you make just one small change in a large file, the hash will change. As a result, DVC will store a copy. So, saving your data in multiple small files is better. DVC will only save duplicates of individual files that have changed instead of the entire dataset.

Having many small files does create some problems with using remote storage, so you’ll need to weigh the pros and cons.

Remote storage

DVC can download and upload cached data to remote storage. It’s a useful single source of truth when collaborating and a backup, too. The

dvc push
command uploads data from the cache to the remote location and pulls downloaded remote data into the cache. Alternatively, you can use the
dvc add -–to-remote
command to add data directly to the remote location without caching it.

Tip 3: You don’t always need remote storage

DVC has many options for adding remotes. However, I don’t think using a remote is always necessary or the right choice. The problem is how the data is pushed to the remote. DVC reviews each file one by one and then communicates with the remote storage to check if it needs to be uploaded. I used WebDAV to synchronise a few thousand images with SURFDrive (part of the national Dutch universitary compute services). It was so slow the operation timed out!

You can speed pushing and pulling up by reducing the number of files, for example, by creating tarballs. However, then you run into the cache problem with large files. I don’t collaborate on my code, so remote storage wasn’t worth the hassle for me.

My 'experiments' with remote storage

Adding data

The DVC tutorial shows how to track a single .csv dataset. We need some tricks to add data in realistic projects with many more inputs, intermediate data and outputs.

Tip 4: Edit dvc.yaml to track data

There are three ways you can track data:

The best option is to manually edit the

dvc.yaml
file. Versioning will work as it should and dvc status reports will be more granular.

Tip 5: Don't add all data in one go

It’s fine if you still want to use

dvc add
. However, don’t be tempted to add your entire data directory in one go. For example, if you already have an existing project with some data files:

  • data/
    • raw_data/
    • normalised_data/
    • splits.csv

If you run

dvc add data
, DVC assumes all files in this folder form a single unit that is generated and changed together. So if you change a single file, DVC will cache a copy of all your data. It also undos the advantage of using dvc repro to automatically check which stages have to rerun and reproduce your dataset, because DVC won’t be able to detect intermediate outputs.

Tip 6: You can only add data from the working directory

I store code in the home directory and data in a larger partition (like the

/data
partition). However, DVC only tracks data added from the working directory. So, you need to move your data to the working directory. But remember, this is only temporary because
dvc commit
moves data to the cache dir. You can move and commit data in chunks if your homedir has very little space.

Tip 7: Create first, track later

Another thing to remember when adding data: you can only add already existing data. So first run your data processing script, then add the outputs to DVC. There used to be a command called

dvc run
that would run and then automatically add the versioning to the data, but, that doesn’t exist anymore. Instead, dvc repro runs data processing pipelines, but this only version already tracked data.

Updating data

You update data as you test, debug, and fine-tune your processing in any standard ML setting. I recommend only tracking data once you’re reasonably sure it’s in a good state to avoid storing incomplete data in cache. Your data will still change, though. Either you find a bug, or something to improve.

DVC protects tracked data so you can’t just overwrite it. I explain how to make changes to tracked data in two scenarios:

  • Updating data
  • Replacing data

Tip 8: Unprotect data to update it

To update an existing file, run:

dvc unprotect <filepath>
python <yourscript.py>
# if adding data with dvc add. You don’t need add when using dvc.yaml
dvc add <filepath>
dvc commit -m “<your commit message>”

dvc unprotect
removes the protection from the data so you can edit it. However, don’t forget to commit the changes, later!

Tip 9: Remove data to replace it

To replace a file or folder, run:

# you can remove individual files, or whole stages
dvc remove <filepath or stage name>
# if replacing a directory, I recommend to first remove it so you dont accidentally keep old files
rm -rf <filepath>
python <yourscript.py>
# if adding data with dvc add. Else, you need to manually add the stage to the dvc.yaml file again
dvc add <filepath>
dvc commit -m “<your commit message>”

dvc remove
removes the stage from dvc.yaml file or removes the DVC file with hash created by running
dvc add
. Replacing versioned data is a little annoying when you add data by editing the
dvc.yaml
file, because after removing the data, you need to manually add the stage again. Moreover,
dvc remove
doesn’t work if you use foreach stages. A workaround is to manually remove the stage from
dvc.yaml
and
dvc.lock
. Personally, I just stopped using foreach stages.

Use

dvc repro
to automatically re-run outdated pipeline stages if you want to avoid all this manual editing.

Further reading

Experiment tools in research

DVC can do more than data pipeline versioning:

  • Experiment tracking, including comparing and plotting results.
  • Run parameter grids.
  • Model versioning.

Another big question is how to make it work with slurm, the only way for me to access serious compute. Some libraries connect to commercial services like aws, but what if your group has its own cluster? And is the library able to deal with the parallelism that comes from launching many slurm jobs at the same time? For example, I haven’t yet figured out if starting parallel

dvc repro
runs can mess up the cache.

Some people propose custom solutions, like this wrapper around DVC to submit jobs to slurm. However, they have their own challenges. This particular package assumes you launch slurm jobs over SSH. None of the academic clusters I know work like this: you always submit jobs from the head node. Furthermore, installing extra packages means more potential python package dependency conflicts and more time investment in learning to use the packages.

I’m hopeful that the experiment tools (=MLOps) landscape will improve for researchers. Fora and Github Issue pages are full of researchers asking “ok but how does it work with slurm?” More and more researchers that create their own toolboxes, like NASLib and OpenML. At some point, commercial software will pick this up and start designing for us, too.


You may also like...

← Back to Blog