7

DVC uses git commits to save the experiments and navigate between experiments.

Is it possible to avoid making auto-commits in CI/CD (to save data artifacts after dvc repro in CI/CD side).

Dmitry Petrov
  • 1,490
  • 1
  • 19
  • 34

1 Answers1

7

will you make it part of CI pipeline

DVC often serves as a part of MLOps infrastructure. There is a popular blog post about CI/CD for ML where DVC is used under the hood. Another example but with GitLab CI/CD.

scenario where you will integrate dvc commit command with CI pipelines?

If you mean git commit of DVC files (not dvc commit) then yes, you need to commit dvc-files into Git during CI/CD process. Auto-commit is not the best practice.

How to avoid Git commit in CI/CD:

  1. After ML model training in CI/CD, save changed dvc-files in external storage (for example GitLab artifact/releases), then get the files to a developer machine and commit there. Users usually write scripts to automate it.
  2. Wait for DVC 1.0 release when run-cache (like build-cache) will be implemented. Run-cache makes dvc-files ephemeral and no additional Git commits will be required. Technically, run-cache is an associative storage repo state --> run results outside of Git repo (in data remote).

Disclaimer: I'm one of the creators of DVC.

Dmitry Petrov
  • 1,490
  • 1
  • 19
  • 34
  • 1
    `Run-cache makes dvc-files ephemeral and no additional Git commits will be required` @dmitry-petrov wanted a clarification: - with DVC 1.0 is it not required to commit dvc-files into git? - Will dvc-files will be tracked in run-cache? - For a given git-commit How will you map which dvc results to be fetched from cache? - Will run experiment parameters/code in current git-commit version is enough to fetch appropriate dvc results from cache? - I still have some doubts around how with DVC 1.0 only changed stages are run, but for now for simplicity say I have only 1 stage in pipeline. – B.P.Puneeth Pai Apr 20 '20 at 04:04
  • @B.P.PuneethPai dvc-files will be still here and can be committed/freeze to Git. At the same time, run-cache track "ephemeral" state (but per stage, not per dvc-file). A user can apply ephemeral state to dvc-file (by `dvc repro`) and commit to Git if needed.
 So, you can bring my training results to your machine w/o sharing a commit. Or you can take results from Ci/CD w/o commit. – Dmitry Petrov Apr 21 '20 at 00:22
  • @B.P.PuneethPai This is a very good question - people frequently ask... I'd really appreciate if you could modify it and resolve the "opinion-based" issue. To rephrase the question you can ask about Git commit directly. – Dmitry Petrov Apr 21 '20 at 00:37
  • 1
    Just an update. DVC 1.0 is released - https://dvc.org/blog/dvc-1-0-release Now you avoid the auto-commits in CI/CD side using the run-cache - `dvc push --run-cache` and `dvc pull --run-cache`. – Dmitry Petrov Jun 25 '20 at 16:20
  • 2
    This is great. Have been thinking how to use dvc together with gitlab cicd yml. It'd be great if there is a chapter in the tutorial on this. – C. Feng Oct 15 '20 at 19:12