5

Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.

But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.

Can somebody clarify which is the better practice? Writing the dvc.yaml or let it be generated by dvc run command? Or is it left to user's choice and there is no technical difference?

rajeshnair
  • 1,587
  • 16
  • 32

2 Answers2

4

Both, really.

Primarily dvc run (or the newer dvc stage add followed by dvc exp run) is meant to mange your dvc.yaml file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config and directly modifying .{git,dvc}/config)

However as you note, dvc.yaml is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach stages).

casper.dcl
  • 13,035
  • 4
  • 31
  • 32
  • 1
    "we are encouraging users to write dvc.yaml (rather) than use dvc run" from https://github.com/iterative/dvc/issues/5180#issuecomment-762643966 (dvc.yaml 2.0 developer) – Jorge Orpinel Pérez Jun 18 '21 at 21:29
  • 1
    @JorgeOrpinel I think that's an exceptionally misquoted out of context extract. The dvc.yaml recommendation there was stated specifically as a) one justification for not yet bothering to implement `${}` support, and b) better than `dvc run` purely due to its anticipated deprecation but NOT better than `dvc stage add`. Please do correct me if I'm wrong. – casper.dcl Jun 19 '21 at 01:33
4

I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)

dvc stage add can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml, for example setting vars values or defining foreach stages.

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38
  • thanks for the answer. Can you point me to official documentation of dvc which recommends manual editing? That will settle the question for me – rajeshnair Jun 16 '21 at 19:57
  • 1
    I don't think it's expressed in the docs but I remember DVC team discussions where this was mentioned. In the end it's up to you of course. Whatever works for you, your team , your workflow, etc. – Jorge Orpinel Pérez Jun 16 '21 at 20:38
  • `stage add` won't let you make silly mistakes, while manually editing will. I don't see any way manually editing will ever become the official de-facto method. About missing e.g. `vars` and `foreach`: these are firstly advanced features and secondly should be supported by `stage add` in future. – casper.dcl Jun 17 '21 at 09:49
  • @casper.dcl, yes , that's my concern as well ! I could not find anywhere in the docs which of the two (dvc.yaml DSL or dvc run ) is guaranteed to be future proof. Reading the docs it looks like DVC has decided to keep both options open. – rajeshnair Jun 17 '21 at 20:00
  • 1
    I found this "Note, we use dvc stage add command instead of dvc run. Starting from DVC 2.0 we begin extracting all stage specific functionality under dvc stage umbrella. dvc run is still working, but will be deprecated in the following major DVC version (most likely in 3.0)." So dvc run will be deprecated in 3.0 and replaced with dvc stage add but for using variables , you still need to manually edit dvc.yaml – rajeshnair Jun 17 '21 at 20:04
  • 1
    `dvc repro` also prevents mistakes as it validates dvc.yaml when it's run. DVC wants to encourage users to manually edit dvc.yaml which is why it's so thoroughly documented (there's even a JSON schema file). `stage add` is kept around for simple use cases and backward compatibility I think, but is optional. And I doubt there are plans to support advanced dvc.yaml features via CLI. See https://github.com/iterative/dvc/issues/5180#issuecomment-752321591 (developer of dvc.yaml 2.0). – Jorge Orpinel Pérez Jun 18 '21 at 22:10
  • 1
    Read @JorgeOrpinel comments in the discussions , I would tend to change my response to mark his answer as the right one. I understand now that there is a only a fine distinction between `dvc stage add` and manually editing dvc.yaml but given that original authors have expressed this in the issues makes me believe that would be the future – rajeshnair Jun 20 '21 at 21:15