How to schedule a datalab ipynb to run regularly

Question

I want to schedule for my codes in a datalab (from Google Cloud Platform, or GCP) ipynb file to run regularly (say, 1 per day). Please help to shed some light on possible workarounds. Thanks!

I think GCP cloud scheduler or cron jobs (cron.ysml) should do it, but I'm not sure how they work exactly.

Here are the codes in my ipynb (which works okay):

import requests
from bs4 import BeautifulSoup
import os
import io

url = "https://www.taiwanlottery.com.tw/lotto/superlotto638/history.aspx"
soup = BeautifulSoup(urllib.request.urlopen(url).read(),
                     features="html.parser", from_encoding='utf-8')

css_url = "https://www.taiwanlottery.com.tw/css1.css"

soup_css = BeautifulSoup(urllib.request.urlopen(
    css_url).read(), features="html.parser", from_encoding='utf-8')


table = soup.find("table", id="SuperLotto638Control_history1_dlQuery")

with io.open("superLottery.html", "w", encoding='utf-16') as f:
    f.write(unicode(table))
    f.write(unicode('<style type = "text/css">'))
    f.write(unicode(soup_css))
    f.write(unicode("</style>"))

!gsutil cp 'superLottery.html' 'gs://astral-petal-222508.appspot.com/datalab-backups/asia-east1-a/new-env'

score 0 · Answer 1 · answered Sep 27 '19 at 14:51

I believe that scheduling a periodic run of an ipython notebook from Cloud Datalab could be too much of an anti pattern to be encouraged.

The jupyter "server" is running inside a container on a Compute Engine VM instance.

At first thought, one could hope to achieve this to convert a "notebook" to a regular Python module and then run it remotely, the problem is the third party libraries dependencies that you might have in place.

Even if there are no dependencies, the required software to convert the notebook isn't installed on the container's image so you'd need to install it on every run between instance restarts.

You could also convert it "yourself", and while this is not guaranteed to run successfully in every case, as a deeper research on how notebooks format is (even though a first glance doesn't seem too complicated), I'll demonstrate how to do it below.

So, let's pipe the notebook source code to the exec Built-in Function, load our dependencies so our exec call runs successfully.

All of this remotely, through the datalab container running on the VM instance.

$ project= #TODO edit
$ zone= #TODO edit
$ instance= #TODO edit

$ gcloud compute ssh $instance --project $project --zone $zone -- docker exec datalab python '-c """
import json
import imp

#TODO find a better way and not escape quote characters?...

bs4 = imp.load_package(\"bs4\", \"/usr/local/envs/py2env/lib/python2.7/site-packages/bs4\")
BeautifulSoup = bs4.BeautifulSoup

notebook=\"/content/datalab/notebooks/notebook0.ipynb\"

source_exclude = (\"from bs4 import BeautifulSoup\")

with open(notebook) as fp:
    source = \"\n\".join(line for cell in json.load(fp)[\"cells\"] if cell[\"cell_type\"]==\"code\" for line in cell[\"source\"] if line not in source_exclude)


#print(source)

exec(source)
"""
'

So far, I couldn't couldn't find another way to not escape the characters as my bash's expertise is not much.

You'll also, at least, get warnings related with some of the imp.load_package libraries's dependencies not being available. This remind us how this approach is not scalable at all.

I don't know what do you think of this but perhaps is better to have the Python source code you want to run on a Cloud Function and then trigger that function with Cloud Scheduler. Check out this community example for that.

I believe a reasonable take from this post is that a notebook can have different use cases than a Python module.

Also, make sure to go through the Cloud Datalab documentation to at least understand some of the concepts this answer relates with.

How to schedule a datalab ipynb to run regularly

1 Answers1