Questions tagged [dvc]

Data Version Control (DVC) is an open-source version control system for ML and data science projects. Use this tag for questions related to DVC usage and workflows.

138 questions
34
votes
2 answers

Difference between git-lfs and dvc

What is the difference between these two? We used git-lfs in my previous job and we are starting to use dvc alongside git in my current one. They both place some kind of index instead of file and can be downloaded on demand. Has dvc some…
12
votes
2 answers

How to use different remotes for different folders?

I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models. One idea I can think of is using separate git submodules for data and models. But that…
Michael Litvin
  • 3,976
  • 1
  • 34
  • 40
11
votes
1 answer

Version control for machine learning data set with large amount of images?

We starting to use dvc with git to control versioning of machine learning projects. For dvc remote storage we use google cloud storage. Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to…
user10333
  • 331
  • 1
  • 9
8
votes
2 answers

dvc (data version control) error - ImportError: cannot import name 'fsspec_loop' from 'fsspec.asyn'

I use Python version 3.7.13 and create a virtual environment (venv) for a MLOps project. A dvc package (=2.10.2) that is compatible with Python== 3.7.13 is installed in this venv. (venv) (base) tony3@Tonys-MacBook-Pro mlops % dvc…
Tony Peng
  • 579
  • 5
  • 10
8
votes
1 answer

By how much can i approx. reduce disk volume by using dvc?

I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model. The data changes over time: sample size increases over time new Features might appear anonymization procedure might Change over…
Tlatwork
  • 1,445
  • 12
  • 35
7
votes
2 answers

Is it possible to check that the version of a file tracked by a DVC metadata file exists in remote storage without pulling the file?

My team has a set up wherein we track datasets and models in DVC, and have a GitLab repository for tracking our code and DVC metadata files. We have a job in our dev GitLab pipeline (run on each push to a merge request) that has the goal of checking…
emccords
  • 95
  • 1
  • 3
7
votes
1 answer

Installation DVC on MinIO storage

Does anybody install DVC on MinIO storage? I have read docs but not all clear for me. Which command should I use for setup MinIO storage with this entrance parameters: storage url: https://minio.mysite.com/minio/bucket-name/ login:…
7
votes
1 answer

Is it necessary to commit DVC files from our CI pipelines?

DVC uses git commits to save the experiments and navigate between experiments. Is it possible to avoid making auto-commits in CI/CD (to save data artifacts after dvc repro in CI/CD side).
6
votes
2 answers

How to execute python from conda environment by dvc run

I have an environment of conda configurated with python 3.6 and dvc is installed there, but when I try to execute dvc run with python, dvc call the python version of main installation of conda and not find the installed libraries. $ conda activate…
Heros
  • 340
  • 5
  • 15
5
votes
2 answers

Is dvc.yaml supposed to be written or generated by dvc run command?

Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command. But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable…
rajeshnair
  • 1,587
  • 16
  • 32
5
votes
1 answer

Git bash command prompt hanging when running dvc push to DAGsHub

I'm having problems pushing files with DVC to DAGsHub. Workflow: I used my email to signup to DAGsHub. I created a repo and clone it to my computer. I added files to the repo and track them using DVC and Git to track the pointer files. Running DVC…
5
votes
1 answer

How does DVC store differences on the directory level into DVC cache?

Can someone explain how DVC stores differences on the directory level into DVC cache. I understand that the DVC-files (.dvc) are metafiles to track data, models and reproduce pipeline stages. However, it is not clear for me how the process of…
mkhlr
  • 51
  • 1
5
votes
0 answers

Azure DataLake with DVC

We are thinking to use DVC for versioning input data for DataScience project. my data resides in Azure DataLake Gen1. how do i configure DVC to push data to Azure DataLake using Service Principal? i want DVC to store cache and data into Azure…
Radhi
  • 6,289
  • 15
  • 47
  • 68
4
votes
2 answers

ERROR: bad DVC file name 'Training_Batch_Files\Wafer12_20012.csv.dvc' is git-ignored

Getting the error "ERROR: bad DVC file name 'Training_Batch_Files\Wafer12_20012.csv.dvc' is git-ignored." while trying to add local files for tracking Python Version : 3.7 Library used: pip install dvc pip install dvc[gdrive] dvc init dvc add…
Dibyaranjan Jena
  • 189
  • 1
  • 2
  • 10
4
votes
1 answer

Problem running a Docker container in Gitlab CI/CD

I am trying to build and run my Docker image using Gitlab CI/CD, but there is one issue I can't fix even though locally everything works well. Here's my Dockerfile: FROM RUN apt update && \ apt install…
Don Draper
  • 463
  • 7
  • 21
1
2 3
9 10