5

Can someone explain how DVC stores differences on the directory level into DVC cache.

I understand that the DVC-files (.dvc) are metafiles to track data, models and reproduce pipeline stages. However, it is not clear for me how the process of creating branches, commiting them and switching back to a master file is exactly saved in differences.

mikus
  • 3,042
  • 1
  • 30
  • 40
mkhlr
  • 51
  • 1

1 Answers1

4

Short version:

  1. .dvc file contains info (md5) about JSON file inside cache that describes current state of directory.

  2. When directory gets updated, there is new md5 in .dvc file and new JSON file is created with updated state of directory.

  3. In git, you store the .dvc file, so that DVC know (basing on md5) where to look for information about directory.

Longer version:

Let me try to break particular steps of directory handling with DVC.

  • Lets assume we have some data directory you want to add under DVC control:
data
├── 1
└── 2
  • You are using dvc add data to make DVC track you directory. In result, DVC produces data.dvc file. As you noted this file contains metadata required to connect your git repository with your data storage. Inside this file (besides other things) you can see:
outs:
- md5: f437247ec66d73ba66b0ade0246fcb49.dir
 path: data
  • The md5 part is used to store information about directory in DVC cache (.dvc/cache):
(dvc3.7) ➜  repo$ tree .dvc/cache
.dvc/cache
├── 26
│   └── ab0db90d72e28ad0ba1e22ee510510
├── b0
│   └── 26324c6904b2a9cb4b88d6d61c81d1
└── f4
    └── 37247ec66d73ba66b0ade0246fcb49.dir

  • If you will open the file with .dir suffix, you will see that it contains description of current data state:
(dvc3.7) ➜  repo$ cat .dvc/cache/f4/37247ec66d73ba66b0ade0246fcb49.dir 
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
 {"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"}]

As you can see, particular files(1 and 2) are described by entries in this file

  • When you change your directory:
(dvc3.7) ➜  repo$ echo 3 >> data/3 
(dvc3.7) ➜  repo$ dvc commit data.dvc

The content of data.dvc will be updated:

outs:
- md5: 12f4b7d54a32e58818e27fba28376fba.dir
  path: data

And there is new file inside the cache:

├── 12
│   └── f4b7d54a32e58818e27fba28376fba.dir
...

(dvc3.7) ➜  repo$ cat .dvc/cache/12/f4b7d54a32e58818e27fba28376fba.dir 
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
 {"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"},
 {"md5": "6d7fce9fee471194aa8b5b6e47267f03", "relpath": "3"}]

From perspecitve of git the only change is inside data.dvc. (Assuming you did git commit after adding data with 1 and 2 inside):

diff --git a/data.dvc b/data.dvc
index 098aec5..88d1a90 100644
--- a/data.dvc
+++ b/data.dvc
@@ -1,6 +1,6 @@
-md5: a427c5bf8680fbf8d1951806b28b82fe
+md5: 1b674d61c195eea7a6b14f176c020b9c
 outs:
-- md5: f437247ec66d73ba66b0ade0246fcb49.dir
+- md5: 12f4b7d54a32e58818e27fba28376fba.dir
   path: data
   cache: true
   metric: false

NOTE: First md5 corresponds to md5 of this file, so it had to change with dir md5 change

don_pablito
  • 382
  • 1
  • 9
  • 1
    Also, check this question https://stackoverflow.com/questions/60365473/by-how-much-can-i-approx-reduce-disk-volume-by-using-dvc/60366262#60366262 - it's another explanation how DVC handles files. – Shcheklein Mar 04 '20 at 17:37