Read Large Data Set to Jupyter Notebook and Manipulate

Question

I am trying to load data from BigQuery to Jupyter Notebook, where I will do some manipulation and plotting. The datasets is 25 millions rows with 10 columns, which definitely exceeds my machine's memory capacity(16 GB).

I have read this post about using HDFStore, but the problem here is that I still need to read the data to Jupyter Notebook to do the manipulation.

I am using Google Cloud Platform, so setting a huge cluster in Dataproc might be an option, though that could be costly.

Anyone gets similar issue and has a solution?

Guillem Xercavins · Answer 1 · 2018-03-01T09:24:50.170

Concerning products within Google Cloud Platform you can create a Datalab instance to run your notebooks and specify the desired machine type with the --machine-type flag (docs). You can use a high-memory machine if needed.

Of course, you can also use Dataproc as you already proposed. For easier setup you can use the predefined initialization action with the following parameter upon cluster creation:

--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh

Edit

As you are using a GCE instance, you can also use a script to autoshutdown the VM when you are not using it. You can edit ~/.bash_logout so that it checks if it's the last session and, if so, stops the VM

if [ $(who|wc -l) == 1 ]; 
then 
  gcloud compute instances stop $(hostname) --zone $(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4) --quiet
fi

Or, if you prefer a curl approach:

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" https://www.googleapis.com/compute/v1/projects/$(gcloud config get-value project 2>\dev\null)/zones/$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/zone 2>\dev\null | cut -d/ -f4)/instances/$(hostname)/stop -d ""

Keep in mind that you might need to update Cloud SDK components to get the gcloud command to work. Either use:

gcloud components update

or

sudo apt-get update && sudo apt-get --only-upgrade install kubectl google-cloud-sdk google-cloud-sdk-datastore-emulator google-cloud-sdk-pubsub-emulator google-cloud-sdk-app-engine-go google-cloud-sdk-app-engine-java google-cloud-sdk-app-engine-python google-cloud-sdk-cbt google-cloud-sdk-bigtable-emulator google-cloud-sdk-datalab -y

You can include one of these and ~/.bash_logout edits in your startup-script.

Finally I set up the high-memory machine VM in compute engine(52 GB), and built the jupyter notebook on top of that. Took me around 2 hours. Cost is not cheap, around $200 per month. — Frank, Feb 26 '18 at 21:39
Take into account that you can save costs by stopping the VM when you are not using it (you won't benefit from sustained discount then, though). Datalab has an [auto shutdown feature](https://cloud.google.com/datalab/docs/concepts/auto-shutdown). Also, you can use [preemptibles](https://cloud.google.com/compute/docs/instances/preemptible) with a startup script if you recreate the instance each time. — Guillem Xercavins, Feb 27 '18 at 08:14
Just added a way to configure auto shutdown for GCE instances, too. Hope that helps. — Guillem Xercavins, Mar 01 '18 at 09:25
Thanks for this information. I will try this out and let you know whether that works. — Frank, Mar 02 '18 at 23:11

Read Large Data Set to Jupyter Notebook and Manipulate

1 Answers1