I am very interested in reproducible data science work. To that end, I am now exploring Docker as a platform which enables bundling of code, data and environment's settings. My first simple attempt is a Docker image which contains the data it needs (link).
However, this is only the first step, in this example, the data is part of the image, and thus when the image is loaded into a container, the data is already there. My next objective is to decouple the code of the analysis and the data. As far as I understand, that would mean to have two containers, one with the code (code
) and one with the data (data
).
For the code
I use a simple Dockerfile
:
FROM continuumio/miniconda3
RUN conda install ipython
and for the data
:
FROM atlassian/ubuntu-minimal
COPY data.csv /tmp
where data.csv
is a data file I'm copying to the image.
After building these two images I can run them as described in this solution:
docker run -i -t --name code --net=data-testing --net-alias=code drorata/minimal-python /bin/bash
docker run -i -t --name data --net=data-testing --net-alias=data drorata/data-image /bin/bash
after starting a network: docker network create data-testing
After these steps I can ping one container from the other, and probably also access data.csv
from code
. But I have this feeling this is a sub optimal solution and cannot be considered good practice.
What is considered a good practice to have a container that can access data? I read a little about data volumes but I don't understand how to utilize them and how to turn them into images.