0

I am very interested in reproducible data science work. To that end, I am now exploring Docker as a platform which enables bundling of code, data and environment's settings. My first simple attempt is a Docker image which contains the data it needs (link).

However, this is only the first step, in this example, the data is part of the image, and thus when the image is loaded into a container, the data is already there. My next objective is to decouple the code of the analysis and the data. As far as I understand, that would mean to have two containers, one with the code (code) and one with the data (data).

For the code I use a simple Dockerfile:

FROM continuumio/miniconda3 
RUN conda install ipython

and for the data:

FROM atlassian/ubuntu-minimal
COPY data.csv /tmp

where data.csv is a data file I'm copying to the image.

After building these two images I can run them as described in this solution:

docker run -i -t --name code --net=data-testing --net-alias=code drorata/minimal-python /bin/bash
docker run -i -t --name data --net=data-testing --net-alias=data drorata/data-image /bin/bash

after starting a network: docker network create data-testing

After these steps I can ping one container from the other, and probably also access data.csv from code. But I have this feeling this is a sub optimal solution and cannot be considered good practice.

What is considered a good practice to have a container that can access data? I read a little about data volumes but I don't understand how to utilize them and how to turn them into images.

Community
  • 1
  • 1
Dror
  • 12,174
  • 21
  • 90
  • 160

1 Answers1

1

the use of a container as data storage is largely considered outdated and deprecated, at this point. you should be using data volumes instead.

but a data volume is not something that you can turn into an image. really, there is no need for this.

if you want to deliver a .csv file to someone and let them use that in their docker container, just give them the .csv file.

the easiest way to get the file into the container and be able to use it, is with a host mounted volume.

using the -v flag on docker run, you can specify a local folder or file to be mounted into the docker container.

Say, for example, your docker image expects to find a file at /data/input.csv. When you call docker run and you want to provide your own input.csv file, you would do something like

docker run -v /my/file/path/input.csv:/data/ my-image

i'm not providing all of the options in this example that you are showing, but i am illustrating the -v flag. this will take your local filesystem's input.csv and mount it into the docker container. now your container will be able to use your copy of that data.

Derick Bailey
  • 72,004
  • 22
  • 206
  • 219
  • And what about copying the CSV into the `code` image using the `Dockerfile`? I'm trying to simplify the sharing of research and to that end you need to provide the code and the data. One interesting approach, is to provide them independently, in two different images and this it what I'm trying to make. – Dror Mar 21 '17 at 20:28
  • for your initial distribution, copy the file into the image. for any updates and distribution of data sets, that's where my answer comes in. there's no need to distribute a second image for data. just give them the data, as i suggested – Derick Bailey Mar 22 '17 at 00:41
  • Is it possible to have some startup hook which is invoked once the container is ran? This hook should pull the data from a repository. – Dror Mar 22 '17 at 05:07
  • yes, you can specify any command you want in the Dockerfile to run the container. if you use a `CMD` instruction, you can run your own `my-script.sh` script file, and in this file you can run any instruction set you wish to run (btw: this should probably be a separate question on StackOverflow... related to the original here, but distinct) – Derick Bailey Mar 22 '17 at 15:29
  • Correct me if I'm wrong, `my-script.sh` should be added in advance so it could be `CMD`ed. Right? – Dror Mar 28 '17 at 07:44