Apache Zeppelin: Save DataFrame from notebook into CSV to local drive via browser

Question

My Zeppelin installation (version - 0.9.0-preview1) is on a server. I have a Spark data frame which I converted to a Pandas dataframe assuming it would be an easy 'df.to_csv()'. I do realize that when I plot the dataframe as a SQL table using %sql, there is an option for downloading csv on the top right of the Helium ribbon. That only works if the number of rows in the data is less than 'zeppelin.spark.maxResult'. I increased the value for 'zeppelin.spark.maxResult' to 25,000 but that makes the browser slow and the application crashes on me. So I rolled it down to 10,000. We need just 10,000 rows to be plotted but if needed we want the capability of downloading the entire dataframe locally rather than just having a truncated dataset (equal to 'zeppelin.spark.maxResult' rows).

After searching, I came across a python function as follows from here:

import base64
import pandas as pd
from IPython.display import HTML

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

df = pd.DataFrame(data = [[1,2],[3,4]], columns=['Col 1', 'Col 2'])
create_download_link(df)

But I get <IPython.core.display.HTML object> in the result.

I even tried to tweak this code and instead of the return HTML(html) I changed that to a display(HTML(html)) after looking here which gave me the same outcome

Another solution as described here for Jupyter notebooks. I tried part of the suggested code:

def csv_download_link(df, csv_file_name):
    """Display a download link to load a data frame as csv from within a Jupyter notebook"""
    df.to_csv(csv_file_name, index=False)
    from IPython.display import FileLink
    display(FileLink(csv_file_name))

csv_download_link(df, 'df.csv')

gave me just a path to where the CSV is saved on the server /folder/folder/df.csv.

Now I have to figure out how to get the data from <IPython.core.display.HTML object> OR how do I create a URL that lets me download the file from the server at location /folder/folder/df.csv. The way Zeppelin's routing system is setup, even if I save it in the notebook folder within Zeppelin (where all notebooks reside), I still cannot access the file using http://server.com/#/notebook/df.csv OR http://server.com/notebook/df.csv inspite of the CSV file residing within the directory. As far as I understand this may be a security measure.

Any suggestions would be greatly appreciated.

In zeppelin 0.9.0, the first solution works (if you use the `%python`interpreter). — Dacit, May 14 '21 at 20:41

Apache Zeppelin: Save DataFrame from notebook into CSV to local drive via browser

0 Answers0