Can't Save Dataframe to Local Mac Machine

Question

I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why?

Connect to Database

#SQL Connector
import pandas as pd
import psycopg2
import numpy as np
from pyspark.sql import *

#Connection
cnx = psycopg2.connect(dbname= 'test', host='test', port= '1234', user= 'test', password= 'test')
cursor = cnx.cursor()

SQL Query

query = """
SELECT * from products;  
"""

# Execute the query
try:
  cursor.execute(query)
except OperationalError as msg: 
  print ("Command skipped: ")

#Fetch all rows from the result
rows = cursor.fetchall()

# Convert into a Pandas Dataframe
df = pd.DataFrame( [[ij for ij in i] for i in rows] )

Exporting Data as CSV to Local Machine

df.to_csv('test.csv')

It does NOT give any error but when I go to my Mac machine's search icon to find "test.csv", it is not existent. I presume that the operation did not work, thus the file was never saved from the Databricks cloud server to my local machine...Does anybody know how to fix it?

It seems like you're running all of this on a remote machine- if so, the file is saved on that remote machine as well. There's not going to be any (easy) python code to export the data to your local machine. The easiest thing would likely be to use some file transfer tool. — pault, Aug 20 '18 at 15:36
Within Databricks, where did the file go? Maybe I can just manually grab it from there. Please advise. — LaLaTi, Aug 20 '18 at 15:50
You should start checking from the directory where this code is in (to be more exact, the working directory). — fractals, Aug 20 '18 at 15:52
When I run os.system('pwd'), it give me Out:0. So, it is not working... — LaLaTi, Aug 20 '18 at 15:59
@Tina try the options outlined [here](https://stackoverflow.com/questions/5137497/find-current-directory-and-files-directory). — pault, Aug 20 '18 at 16:16

Vaidas Armonas · Answer 1 · 2018-08-20T16:27:11.673

0

Since you are using Databricks, you are most probably working on a remote machine. Like it was already mentioned, saving the way you do wont work (file will be save to the machine your notebooks master node is on). Try running:

import os

os.listdir(os.getcwd())

This will list all the files that are in directory from where notebook is running (at least it is how jupyter notebooks work). You should see saved file here.

However, I would think that Databricks provides a utility functions to their clients for easy data download from the cloud. Also, try using spark to connect to db - might be a little more convenient.

I think these two links should be useful for you:

Similar question on databricks forums

Databricks documentation

edited Aug 20 '18 at 16:27

answered Aug 20 '18 at 15:44

Vaidas Armonas

471
4
6

Within Databricks, where did the file go? Maybe I can just manually grab it from there. Please advise. – LaLaTi Aug 20 '18 at 15:45
How can I find this file on a remote server when it was not saved in my personal work space OR how can I save this file in my workspace directly? – LaLaTi Aug 20 '18 at 15:57

score 0 · Answer 2 · answered Aug 21 '18 at 07:45

Because you're running this in a Databricks notebook, when you're using Pandas to save your file to test.csv, this is being saved to the Databricks driver node's file directory. A way to test this out is the following code snippet:

# Within Databricks, there are sample files ready to use within 
# the /databricks-datasets folder    
df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", inferSchema=True, header=True)

# Converting the Spark DataFrame to a Pandas DataFrame
import pandas as pd
pdDF = df.toPandas()

# Save the Pandas DataFrame to disk
pdDF.to_csv('test.csv')

The location of your test.csv is within the /databricks/driver/ folder of your Databricks' cluster driver node. To validate this:

# Run the following shell command to see the results
%sh cat test.csv

# The output directory is shown here
%sh pwd

# Output
# /databricks/driver

To save the file to your local machine (i.e. your Mac), you can view the Spark DataFrame using the display command within your Databricks notebook. From here, you can click on the "Download to CSV" button which is highlighted in red in the below image.

score 0 · Accepted Answer · answered Aug 22 '18 at 20:34

Select from SQL Server:

import pypyodbc 
cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
                        "Server=Server_Name;"
                        "Database=TestDB;"
                        "Trusted_Connection=yes;")

#cursor = cnxn.cursor()
#cursor.execute("select * from Actions")
cursor = cnxn.cursor()
cursor.execute('SELECT * FROM Actions')

for row in cursor:
    print('row = %r' % (row,))

From SQL Server to Excel:

import pyodbc
import pandas as pd

# cnxn = pyodbc.connect("Driver={SQL Server};SERVER=xxx;Database=xxx;UID=xxx;PWD=xxx")
cnxn = pyodbc.connect("Driver={SQL Server};SERVER=EXCEL-PC\SQLEXPRESS;Database=NORTHWND;")
data = pd.read_sql('SELECT * FROM Orders',cnxn)

data.to_excel('C:\\your_path_here\\foo.xlsx')

Can't Save Dataframe to Local Mac Machine

Connect to Database

SQL Query

Exporting Data as CSV to Local Machine

3 Answers3