1

how can we synchronize notebooks between a jupyter service and other services (Google Cloud Storage or git repository)?

Some background on this question:

Currently I am on the way moving from Google's Datalab to my own container. My motivation is to have more control over data region (Datalab Beta only offered in US) and packages as I want to use the current Tensorflow version.

Based on the ideas by Google (see github), I build my own Docker image and run it on my Kubernetes cluster in the Google container engine. The GCP package can be installed as I have previously explained. Google uses a node.js server to sync GIT with the datalab instance - However I was not able to get this running with a self-deployed container in EU.

Second try was the GCSFuse driver. This one does not work for non-priviliged containers as of Kubernetes v1.0 and Google Container Engine. So full stop.

My Docker file (based on Google's GCE Datalab image):

FROM debian:jessie

# Setup OS and core packages
RUN apt-get clean
RUN echo "deb-src http://ftp.be.debian.org/debian testing main" >> /etc/apt/sources.list && \
apt-get update -y && \
apt-get install --no-install-recommends -y -q \
    curl wget unzip git vim build-essential ca-certificates pkg-config \
    libatlas-base-dev liblapack-dev gfortran \
    libpng-dev libfreetype6-dev libxft-dev \
    libxml2-dev \
    python2.7 python-dev python-pip python-setuptools python-zmq && \
mkdir -p /tools && \
mkdir -p /srcs && \
cd /srcs && apt-get source -d python-zmq && cd

WORKDIR /datalab


# Setup Google Cloud SDK
RUN apt-get install --no-install-recommends -y -q wget unzip git -y
RUN wget -nv https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.zip && \
unzip -qq google-cloud-sdk.zip -d tools && \
rm google-cloud-sdk.zip && \
tools/google-cloud-sdk/install.sh --usage-reporting=false \
    --path-update=false --bash-completion=false \
    --disable-installation-options && \
tools/google-cloud-sdk/bin/gcloud config set --scope=installation \
    component_manager/fixed_sdk_version 0.9.57 && \
tools/google-cloud-sdk/bin/gcloud -q components update \
    gcloud core bq gsutil compute preview alpha beta && \
rm -rf /root/.config/gcloud

# Install FUSE driver for GCE
RUN apt-get install -y lsb-release
RUN echo "deb http://packages.cloud.google.com/apt gcsfuse-jessie main" >     /etc/apt/sources.list.d/gcsfuse.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update && apt-get install -y gcsfuse
RUN mkdir /datalab/mount

# Setup Python packages
RUN pip install -U \
    tornado==4.2.1 pyzmq==14.4.0 jinja2==2.7.3 \
    jsonschema==2.5.1 py-dateutil==2.2 pytz==2015.4 pandocfilters==1.2.4 pygments==2.0.2 \
    argparse==1.2.1 mock==1.2.0 requests==2.4.3 oauth2client==1.4.12 httplib2==0.9.2 \
    futures==3.0.3 && \
    pip install -U numpy==1.9.2 && \
    pip install -U pandas==0.16.2 && \
    pip install -U scikit-learn==0.16.1 && \
    pip install -U scipy==0.15.1 && \
    pip install -U sympy==0.7.6 && \
    pip install -U statsmodels==0.6.1 && \
    pip install -U matplotlib==1.4.3 && \
    pip install -U ggplot==0.6.5 && \
    pip install -U seaborn==0.6.0 && \
    pip install -U notebook==4.0.2 && \
    pip install -U PyYAML==3.11 && \
    easy_install pip && \
    find /usr/local/lib/python2.7 -type d -name tests | xargs rm -rf

# Path configuration
ENV PATH $PATH:/datalab/tools/google-cloud-sdk/bin
ENV PYTHONPATH /env/python

# IPython configuration
WORKDIR /datalab
RUN ipython profile create default
RUN jupyter notebook --generate-config
ADD ipython.py /root/.ipython/profile_default/ipython_config.py

# Install TensorFlow.
RUN wget -nv https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl && \
  pip install --upgrade tensorflow-0.7.1-cp27-none-linux_x86_64.whl && rm tensorflow-0.7.1-cp27-none-linux_x86_64.whl


# Add build artifacts
ADD build/lib/GCPData-0.1.0.tar.gz /datalab/lib/
ADD build/lib/GCPDataLab-0.1.0.tar.gz /datalab/lib/
ADD setup-repo.sh /datalab
ADD setup-env.sh /datalab
ADD run.sh /datalab
RUN chmod 755 /datalab/*

# Install build artifacts
RUN cd /datalab/lib/GCPData-0.1.0 && python setup.py install
RUN cd /datalab/lib/GCPDataLab-0.1.0 && python setup.py install

RUN mkdir /datalab/content
WORKDIR /datalab/content
EXPOSE 6006
EXPOSE 8123
# see https://github.com/ipython/ipython/issues/7062
CMD ["/datalab/run.sh"]
Community
  • 1
  • 1
Frank
  • 406
  • 2
  • 13

1 Answers1

1

Ok, I solved the problem:

  1. Use post-save hooks as explained in a previous post
  2. Use several git commands in the hook as explained by this blog

Here is the code from (2.) for archiving. It goes into the ipython.py:

import os
from subprocess import check_call
from shlex import split

...

def post_save(model, os_path, contents_manager):
    """post-save hook for doing a git commit / push"""
    if model['type'] != 'notebook':
        return  # only do this for notebooks
    workdir, filename = os.path.split(os_path)
    if filename.startswith('Scratch') or filename.startswith('Untitled'):
        return  # skip scratch and untitled notebooks
    # now do git add / git commit / git push
    check_call(split('git add {}'.format(filename)), cwd=workdir)
    check_call(split('git commit -m "notebook save" {}'.format(filename)), cwd=workdir)
    check_call(split('git push'), cwd=workdir)

c.FileContentsManager.post_save_hook = post_save

My run.sh utilizes setup-env.sh and setup-repo.sh from Google Datalab and consequently depends on the gcloud commands and Kubernetes deployments for credentials. Otherwise please ensure to extend your Dockerfile with credentials.

cd /datalab/content
. /datalab/setup-env.sh
. /datalab/setup-repo.sh
if [ $? != "0" ]; then
    exit 1
fi
cd /datalab/content/master_branch  # multiple branches not planed here!
/usr/local/bin/jupyter notebook --ip=* --no-browser --port=8123
Community
  • 1
  • 1
Frank
  • 406
  • 2
  • 13
  • Btw: This solution is a) faster and b) more stable compared to the node.js server which runs datalab (kernel interrupt possible, kernel restarts are reliable). Maybe Google can consider switching node.js to something more stable. – Frank Apr 22 '16 at 08:32