Save python data object to file in google storage from a pyspark job running in dataproc

Question

I'm collecting metrics while running a pyspark job with dataproc and I'm unable to persist them in google storage (using only python functions, not Spark).

The point is that I can save them and during the execution, I read and modificate them successfully, but when the job ends, there's nothing in my google storage folder.

Is it possible to persist python objects or this is only possible using pyspark libraries?

Edit: I add a code snippet to clarify the question

# Python
import pandas as pd

# Pyspark
from pyspark.sql import SparkSession

# Google storage filepath
filepath = 'gs://[PATH]/'

spark_session = SparkSession.builder.getOrCreate()

sdf = spark_session.createDataFrame([[1],[2],[3],[4],[5]], ['col'])
pdf = pd.DataFrame([1,2,3,4,5], columns=['col'])

# Save the pandas dataframe (THIS IS NOT PERFORMED IN MY BUCKET)
pdf.to_pickle(filepath + 'pickle.pkl' )

# Save the spark dataframe (THIS IS PERFORMED IN MY BUCKET)
sdf.write.csv(filepath + 'spark_dataframe.csv')

# read pickle (THIS WORKS BUT ONLY DURING THIS JOB EXECUTION, 
# IT'S NOT ACCESSIBLE BY ME, maybe its in some temporal folder only)
df_read = pd.read_pickle(filepath + 'pickle.pkl' )

"unable to persist *them* in google storage" - if you're referring to Python objects, you can [pickle](https://stackoverflow.com/questions/4529815/saving-an-object-data-persistence) them. — sshine, Feb 08 '18 at 15:03
I can pickle them during the execution, and I've checked that because in other parts of the execution I read and modificate that pickle. The problem is that pickle is not saved in the google storage location that I specify, because it's not in my bucket. — Luis A.G., Feb 08 '18 at 16:17
This is expected behavior. `to_pickle` writes to local drive storage, which will be discarded later. If you want it preserved, you have to move it to some permanent storage yourself. — zero323, Feb 08 '18 at 22:17
I see, and do you know if there's any pre-installed library in dataproc that allows me do that? It's to avoid more time building up dataproc clusters. — Luis A.G., Feb 09 '18 at 09:48
Have you considered using `to_pickle` to save them to local file and running Hadoop's [DistCp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) to synchronize them with a GCS Bucket? You can use the [subprocess](http://www.pythonforbeginners.com/os/subprocess-for-system-administrators) library from Pyspark for shell commands — Guillem Xercavins, Feb 15 '18 at 16:59
I would like to save it directly on GC storage from dataproc, because google charges for operations outside GCP. For synchronizing local files I use the google.cloud storage library. — Luis A.G., Feb 19 '18 at 07:21
You can use `DistCp` directly from Dataproc. Alternatively, you could use `gsutil` also with [this approach](https://stackoverflow.com/questions/39945687/downloading-files-from-google-storage-using-spark-python-and-dataproc), which is similar to mine in using the `subprocess` library. — Guillem Xercavins, Feb 19 '18 at 08:17

score 2 · Answer 1 · answered Feb 21 '18 at 11:24

Elaborating on my previous comments, I modified your example to copy Pickle objects to GCS:

# Python
import pandas as pd
from subprocess import call
from os.path import join

# Pyspark
from pyspark.sql import SparkSession

# Google storage filepath
filepath = 'gs://BUCKET_NAME/pickle/'
filename = 'pickle.pkl'

spark_session = SparkSession.builder.getOrCreate()

sdf = spark_session.createDataFrame([[1],[2],[3],[4],[5]], ['col'])
pdf = pd.DataFrame([1,2,3,4,5], columns=['col'])

# Save the pandas dataframe locally
pdf.to_pickle('./gsutil/' + filename )
pdf.to_pickle('./distcp/' + filename )

# Synch with bucket
call(["gsutil","-m","cp",'./gsutil/',join(filepath,filename)])

call(["hadoop","fs","-put","./distcp/","/user/test/"])
call(["hadoop","distcp","/user/test/distcp/" + filename,join(filepath,"distcp/" + filename)])

Also, be sure to create the necessary folders (local and HDFS) and replace the correct BUCKET_NAME beforehand for the example to work.

Save python data object to file in google storage from a pyspark job running in dataproc

1 Answers1