I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer. Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url? I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push . And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format? If it is not possible is there then another way to download data automatically from azure?
2 Answers
Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format?
:). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
- Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
- Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
- It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url
- it'll download objects, it'll cache them, and will by default push (dvc push
) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file.dvc
that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".dvc get-url
- this more or lesswget
orrclone
oraws s3 cp
, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
- Similar to
import-url
but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc
files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.

- 5,979
- 7
- 44
- 53
I'll build up on @Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
- We're using a DVC pipeline, specified in an existing
dvc.yaml
file. The first stage in the current pipeline is calledprepare
. - Our data is stored on some Azure blob storage container, in a folder named
dataset/
. This folder follows a structure of sub-folders that we'd like to keep intact. - The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name
myazure
(more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/
folder with the dataset/
folder on the remote container.
This can be achieved with a command-line tool called azcopy
, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy
to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
- Add an initial
update_dataset
stage to the DVC pipeline that checks if changes have been made in the remotedataset/
directory (i.e., file additions, modifications or removals). If changes are detected, theupdate_datset
stage shall use theazcopy sync [src] [dst]
command to apply the changes on the Azure blob storage container (the[src]
) to the localdataset/
folder (the[dst]
) - Add a dependency between
update_dataset
and the subsequent DVC pipeline stageprepare
, using a 'dummy' file. This file should be added to (a) the outputs of theupdate_dataset
stage; and (b) the dependencies of theprepare
stage.
Implementation
This procedure has been tested on Windows 10.
- Add a simple
update_dataset
stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated
as an output of the stage.
- Edit the
dvc.yaml
file directly to modify the command of theupdate_dataset
stage. After the modifications, the command shall (a) create the.dataset_updated
file after theazcopy
command -touch .dataset_updated
- and (b) pass the current date and time to the.dataset_updated
file to guarantee uniqueness between different update events -echo %date%-%time% > .dataset_updated
.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml
file directly to modify the command, as I wasn't able to come up with a complete dvc add stage
command that took care of everything in one go.
This is due to the use of multiple commands chained by &&
, special characters in the Azure connection string, and the echo
expression that needs to be evaluated dynamically.
- To make the
prepare
stage depend on the.dataset_updated
file, edit thedvc.yaml
file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
- Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the
prepare
stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation. Instead of the
az copy
command, it usesazcopy sync
. The advantage ofazcopy sync
is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.This example relies on a full connection string with an SAS token, but you can probably do without it if you configure
azcopy
with your credentials or fetch the appropriate values from environment variablesWhen defining the DVC pipeline stage, I've intentionally left out an output dependency with the local
dataset/
folder - i.e. the-o dataset
part - as it was causing theazcopy
command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.When defining the
azcopy
command, I've included the--delete-destination="true"
option. This allows synchronization of deleted files, i.e. files are deleted on your localdataset
folder if deleted on the Azure container.

- 141
- 2
- 8