My Zeppelin installation (version - 0.9.0-preview1) is on a server. I have a Spark data frame which I converted to a Pandas dataframe assuming it would be an easy 'df.to_csv()'. I do realize that when I plot the dataframe as a SQL table using %sql, there is an option for downloading csv on the top right of the Helium ribbon. That only works if the number of rows in the data is less than 'zeppelin.spark.maxResult'. I increased the value for 'zeppelin.spark.maxResult' to 25,000 but that makes the browser slow and the application crashes on me. So I rolled it down to 10,000. We need just 10,000 rows to be plotted but if needed we want the capability of downloading the entire dataframe locally rather than just having a truncated dataset (equal to 'zeppelin.spark.maxResult' rows).
After searching, I came across a python function as follows from here:
import base64
import pandas as pd
from IPython.display import HTML
def create_download_link( df, title = "Download CSV file", filename = "data.csv"):
csv = df.to_csv()
b64 = base64.b64encode(csv.encode())
payload = b64.decode()
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
df = pd.DataFrame(data = [[1,2],[3,4]], columns=['Col 1', 'Col 2'])
create_download_link(df)
But I get <IPython.core.display.HTML object>
in the result.
I even tried to tweak this code and instead of the return HTML(html)
I changed that to a display(HTML(html))
after looking here which gave me the same outcome
Another solution as described here for Jupyter notebooks. I tried part of the suggested code:
def csv_download_link(df, csv_file_name):
"""Display a download link to load a data frame as csv from within a Jupyter notebook"""
df.to_csv(csv_file_name, index=False)
from IPython.display import FileLink
display(FileLink(csv_file_name))
csv_download_link(df, 'df.csv')
gave me just a path to where the CSV is saved on the server /folder/folder/df.csv
.
Now I have to figure out how to get the data from <IPython.core.display.HTML object>
OR how do I create a URL that lets me download the file from the server at location /folder/folder/df.csv
. The way Zeppelin's routing system is setup, even if I save it in the notebook folder within Zeppelin (where all notebooks reside), I still cannot access the file using http://server.com/#/notebook/df.csv
OR http://server.com/notebook/df.csv
inspite of the CSV file residing within the directory. As far as I understand this may be a security measure.
Any suggestions would be greatly appreciated.