Google Dataproc with Jupyter - Downloading files generated by notebook

Question

We're using Google Cloud Dataproc for quick data analysis, and we use Jupyter notebooks a lot. A common case for us is to generate a report which we then want to download as a csv.

In a local Jupyter env this is possible using FileLink for example:

from IPython.display import FileLinks
df.to_csv(path)
FileLinks(path)

This doesn't work with Dataproc because the notebooks are kept on a Google Storage bucket and the links generated are relative to that prefix, for example http://my-cluster-m:8123/notebooks/my-notebooks-bucket/notebooks/my_csv.csv

Does anyone know how to overcome this? Of course we can scp the file from the machine but we're looking for something more convenient.

score 1 · Answer 1 · answered Jan 13 '19 at 22:40

1

To share report you can save it to Google Cloud Storage (GCS) instead of local file.

To do so, you need to convert your Pandas DataFrame to Spark DataFrame and write it to GCS:

sparkDf = SQLContext(SparkContext.getOrCreate()).createDataFrame(df)
sparkDf.write.csv("gs://<BUCKET>/<path>")

answered Jan 13 '19 at 22:40

Igor Dvorzhak

4,360
3
17
31

1

Thanks, this will work but not what I'm looking for. It's actually better to use Dask for this. But I was looking for something more convenient without conversions – Avision Jan 14 '19 at 06:40
In this case, you may want to try to override URL prefix by specifying `url_prefix` and/or `result_html_prefix` parameters in [`FileLinks`](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html?#IPython.display.FileLinks) call. – Igor Dvorzhak Jan 14 '19 at 06:58
I've tried toying with it a bit but couldn't find a way to fix it. Do you know what prefix or html_prefix I should use? – Avision Jan 15 '19 at 12:56
I would guess that you need to specify master hostname as a prefix, but anyway you will need to make it accessible to internet by opening up firewall rules which is insecure or ssh into the network which is not convenient. That's why the best option would be to use GCS to share your reports. – Igor Dvorzhak Jan 16 '19 at 01:44

Google Dataproc with Jupyter - Downloading files generated by notebook

1 Answers1