12

I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models.

One idea I can think of is using separate git submodules for data and models. But that feels cumbersome and imposes some additional requirements from the end user (e.g. having to do git submodule update).

So can I do this without using git submodules?

Michael Litvin
  • 3,976
  • 1
  • 34
  • 40

2 Answers2

13

You can first add the different DVC remotes you want to establish (let's say you call them data and models, each one pointing to a different GC bucket). But don't set any remote as the project's default; This way, dvc push won't work without the -r (or --remote) option.

You would then need to push each directory or file individually to the appropriate remote, like dvc push data/ -r data and dvc push model.dat -r models.

Note that a feature request to configure this exists on the DVC repo too. See Specify file types that can be pushed to remote.

Jorge Orpinel Pérez
  • 6,361
  • 1
  • 21
  • 38
  • 1
    @michael-litvin I commented on the issue for you. See: https://github.com/iterative/dvc/issues/2095#issuecomment-556126420. Feel free to subscribe to that issue and/or participate. – Jorge Orpinel Pérez Nov 20 '19 at 17:18
  • 2
    You may wrap some bash scripts or a make file around those commands to make them less error prone. – Suor Nov 20 '19 at 19:01
  • When you have a mono-repo htat host serveral projects contributed by different person, this would still be a problem as people may just accidentally push data to the wrong remote and then everything will be messed up. – link89 Nov 04 '22 at 07:23
  • Not if you isolate the projects in the monorepo. See https://dvc.org/doc/command-reference/init#initializing-dvc-in-subdirectories – Jorge Orpinel Pérez Nov 05 '22 at 16:31
7

Yes, you can use multiple remotes without Git-submodules.

There is a separate command for using data artifacts from external repositories: dvc import http://your-repo datadir The command brings data to your repo and keeps the connection to the original repo (to avoid data duplication in different remotes).

In your case, one repository can be used for a dataset with its own data remote. A second repo might be used for the code and models which imports the dataset project while all it's models and outputs go to another data remote.

With import, no dvc push -r myremote are needed. A default dvc push synchronize data in a proper remote.

EDITED: Simply use one Git repo for dataset with its data-remote/S3-folder, and import it from another repo with code, model and another data-remote/S3-folder.

Dmitry Petrov
  • 1,490
  • 1
  • 19
  • 34
  • 1
    I made a related comment with some code examples: https://github.com/iterative/dvc/issues/2095#issuecomment-560017410 – Dmitry Petrov Nov 30 '19 at 20:24