We starting to use dvc with git to control versioning of machine learning projects. For dvc remote storage we use google cloud storage.
Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:
- It took a lot of time to add data set for tracking.
- Very slow upload.
- Very slow download.
- Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....
From another way if we zipping our data set and track it as single file dvc work fast enough.But the problem is in this way we can't track changes for particular file.
The goal is to have version control for data set with large amount of files with next functionality.
- Tracking for each single file.
- Committing only changes and not whole data set.
- Fast checkout/pull
Any suggestion for better solution acceptable.