Short version:
.dvc
file contains info (md5) about JSON file inside cache that describes current state of directory.
When directory gets updated, there is new md5 in .dvc
file and new JSON file is created with updated state of directory.
In git, you store the .dvc
file, so that DVC know (basing on md5) where to look for information about directory.
Longer version:
Let me try to break particular steps of directory handling with DVC.
- Lets assume we have some data directory you want to add under DVC control:
data
├── 1
└── 2
- You are using
dvc add data
to make DVC track you directory. In result, DVC produces data.dvc
file. As you noted this file contains metadata required to connect your git repository with your data storage. Inside this file (besides other things) you can see:
outs:
- md5: f437247ec66d73ba66b0ade0246fcb49.dir
path: data
- The
md5
part is used to store information about directory in DVC cache (.dvc/cache
):
(dvc3.7) ➜ repo$ tree .dvc/cache
.dvc/cache
├── 26
│ └── ab0db90d72e28ad0ba1e22ee510510
├── b0
│ └── 26324c6904b2a9cb4b88d6d61c81d1
└── f4
└── 37247ec66d73ba66b0ade0246fcb49.dir
- If you will open the file with
.dir
suffix, you will see that it contains description of current data
state:
(dvc3.7) ➜ repo$ cat .dvc/cache/f4/37247ec66d73ba66b0ade0246fcb49.dir
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
{"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"}]
As you can see, particular files(1
and 2
) are described by entries in this file
- When you change your directory:
(dvc3.7) ➜ repo$ echo 3 >> data/3
(dvc3.7) ➜ repo$ dvc commit data.dvc
The content of data.dvc
will be updated:
outs:
- md5: 12f4b7d54a32e58818e27fba28376fba.dir
path: data
And there is new file inside the cache:
├── 12
│ └── f4b7d54a32e58818e27fba28376fba.dir
...
(dvc3.7) ➜ repo$ cat .dvc/cache/12/f4b7d54a32e58818e27fba28376fba.dir
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
{"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"},
{"md5": "6d7fce9fee471194aa8b5b6e47267f03", "relpath": "3"}]
From perspecitve of git the only change is inside data.dvc
.
(Assuming you did git commit
after adding data
with 1
and 2
inside):
diff --git a/data.dvc b/data.dvc
index 098aec5..88d1a90 100644
--- a/data.dvc
+++ b/data.dvc
@@ -1,6 +1,6 @@
-md5: a427c5bf8680fbf8d1951806b28b82fe
+md5: 1b674d61c195eea7a6b14f176c020b9c
outs:
-- md5: f437247ec66d73ba66b0ade0246fcb49.dir
+- md5: 12f4b7d54a32e58818e27fba28376fba.dir
path: data
cache: true
metric: false
NOTE: First md5 corresponds to md5 of this file, so it had to change with dir md5 change