11

We starting to use dvc with git to control versioning of machine learning projects. For dvc remote storage we use google cloud storage.

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

  1. It took a lot of time to add data set for tracking.
  2. Very slow upload.
  3. Very slow download.
  4. Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....

From another way if we zipping our data set and track it as single file dvc work fast enough.But the problem is in this way we can't track changes for particular file.

The goal is to have version control for data set with large amount of files with next functionality.

  1. Tracking for each single file.
  2. Committing only changes and not whole data set.
  3. Fast checkout/pull

Any suggestion for better solution acceptable.

user10333
  • 331
  • 1
  • 9
  • Have you tried SVN? – ElpieKay May 08 '19 at 08:38
  • No but as i know version controls do not built for such type of tasks.That's because we use dvc inside git – user10333 May 08 '19 at 10:24
  • Hi @user10333 (I'm one of the DVC creators). It's sad to hear that performance is not optimal on such a simple case. What DVC version are you using? How long does it take to add 200MB in your case? upload/download - is it S3 or something else? – Shcheklein May 08 '19 at 19:07
  • 1
    @user10333 I've created a ticket to investigate and improve the performance - would you mind to give us more details there? https://github.com/iterative/dvc/issues/1970 – Shcheklein May 08 '19 at 19:14
  • Hi Shcheklein it tooks for dvc add and dvc push about 2hour with 30mb upload speed – user10333 May 09 '19 at 09:05
  • @user10333 this is weird and looks like a bug. Would you mind to provide a little bit more details in the ticket I created? We need at least - the cloud provide you are using, number of files (in the 30mb set). – Shcheklein May 09 '19 at 23:33
  • Hi Shcheklein if i understood you correctly.We are using google cloud storage as remote for dvc.In storage we use 1 bucket.Total amount of files exceeds 100000, total size on disk 229mb ,average size of file about 1.3 kb.Our Upload speed is 30mb and download speed also 30mb. I checked upload of our dataset to similar google storage bucket without dvc and it tooks about 25 min. – user10333 May 12 '19 at 06:48
  • 1
    DVC maintainer here. For the record: we've introduced lots of optimizations in 1.0 that improve the experience significantly. We've shared some charts in https://dvc.org/blog/dvc-1-0-release . Please give it a try :) – Ruslan Kuprieiev Aug 25 '20 at 21:08

1 Answers1

0

From another way if we zipping our data set and track it as single file dvc work fast enough.But the problem is in this way we can't track changes for particular file.

The zip file is the right approach, combine with Git LFS in order to store many revision of that zip file.

You could complement that archive with a text file listing all the images, each one with a comment describing any change done to it: that way, since the txt file would be committing alongside any new revision of the archive, you would still be able to get the list and nature of changes done in the elements of the archive.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Worth knowing. Github limits the size of the LFS files: https://docs.github.com/en/github/managing-large-files/about-git-large-file-storage. – Philippe Remy Jan 29 '21 at 06:25
  • @PhilippeRemy I agree. This won't work for any "large amount" of data... unless you purchase a data pack: https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/upgrading-git-large-file-storage – VonC Jan 29 '21 at 07:41