1

I have been running experiments in aws. all my data and models are in s3 buckets, all in the same bucket, just different prefix/folders. I don't necessarily need to download the models, as they are hosted in aws. based on the docs , provided at https://dvc.org/, i can set up a remote storage and cache. do i have to add each file, like dvc add --external s3://mybucket/existing-data/file1 then do the same for file 2 and so forth ? on the same line, the output are in another folder/prefix. once the training is done, i will know the exact path of the s3 path where model is dumped. all the model output model.pkl files are stored in same directory, just different names. I can add it via dvc command , dvc add --external s3://mybucket/models/modelone.pkl. do i need to do for each file or can i just do entire directory s3://mybucket/models?

dvc remote add s3cache s3://mybucket/cache
dvc config cache.s3 s3cache

dvc add --external s3://mybucket/existing-data

once the models are added, how can i version them , so that in case , if i need to go back to a previous version, how can i assign them a version and download it , if i need to ?

arve
  • 569
  • 2
  • 10
  • 27
  • 1
    Hey, @arve. `--external` is not supported by `dvc import/get/ls/etc` so you won't be able to download a particular version other than by `dvc checkout` that will restore a particular version (from corresponding dvc file) directly on s3. Could you elaborate on your scenario, please? Is your pipeline managed by dvc as well? – Ruslan Kuprieiev Apr 02 '23 at 17:05
  • @RuslanKuprieiev - thanks. not managing pipelines yet. I want to track input/output via dvc for now. i run my training on aws sagemaker. I create a branch and run dvc add --external s3://data, to track my data , once training is done, i add the output path of the model, via - dvc add --external s3://data/output/model.tar . I'm looking at GTO as well to add more metadata/versions to my model outputs , but didnt' find good examples. since i have set up cache to remote s3://data/cache, both my input/output should be cached too right? I want to be able to track input and version model outputs – arve Apr 03 '23 at 00:51

0 Answers0