Upload Pandas DataFrame to GCP Bucket for Dataproc

Question

I have been working on Spark Cluster using Data Proc google cloud services for Machine Learning Modelling. I have been successful to load the data from the Google Storage bucket. However, I am not sure how to write the panda's data frame and spark data frame to the cloud storage bucket as csv.

When I use the below command it gives me an error

df.to_csv("gs://mybucket/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
formatter.save()
File "/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 156, in save
compression=self.compression)
File "/opt/conda/lib/python3.6/site-packages/pandas/io/common.py", line 400, in _get_handle
f = open(path_or_buf, mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: 'gs://dataproc-78f5e64b-a26d-4fe4-bcf9-e1b894db9d8f-au-southeast1/trademe_xmas.csv'

FileNotFoundError: [Errno 2] No such file or directory: 'gs://mybucket/'

however the following command work but I am not sure where it is saving the file

df.to_csv("data.csv")

I also followed the below article and it gives the following error Write a Pandas DataFrame to Google Cloud Storage or BigQuery

import google.datalab.storage as storage
ModuleNotFoundError: No module named 'google.datalab'

I am relatively new to Google Cloud Data Proc and Spark and I was hoping if someone can help me understand how can I save my output pandas data frame to gcloud bucket

Thanks in Advance !!

########For Igor as Requested

from pyspark.ml.classification import RandomForestClassifier as RF
rf = RF(labelCol='label', featuresCol='features',numTrees=200)
fit = rf.fit(trainingData)
transformed = fit.transform(testData)

from pyspark.mllib.evaluation import BinaryClassificationMetrics as metric
results = transformed.select(['probability', 'label'])


#Decile Creation for the Output
test = results.toPandas()
test['X0'] = test.probability.str[0]
test['X1'] = test.probability.str[1]
test = test.drop(columns=['probability'])
test = test.sort_values(by='X1', ascending=False)
test['rownum'] = test.reset_index().index
x = round(test['rownum'].count() / 10)
test['rank'] = (test.rownum - 1)//x + 1

You may want to try to write CSV to the file in the bucket, not bucket itself: `df.to_csv("gs://mybucket/data.csv")`. If this will not work, make sure that `mybucket` bucket exists: `gsutil mb gs://mybucket/` — Igor Dvorzhak, Nov 05 '18 at 06:21
Hello Igor, The bucket does exist. I loaded the raw data from the same location and I was able to load the data using the below command `raw_data = (spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("gs://mybucket/new_results.csv")` However when I am trying to save the file using `df.to_csv("gs://mybucket/data.csv")` it does not work. I have updated the complete error above in the question — Tushar Mehta, Nov 05 '18 at 21:20
You can write Spark DataFrame with `df.write.csv("gs://mybucket/data.csv")`. May you provide repro of how you are using Pandas, so it will be easier to help you with writing Pandas DataFrame to GCS? — Igor Dvorzhak, Nov 05 '18 at 22:11
@Igor, I have updated the entire code of what I am doing in the main question. I trained my model and passed it though the test data to get the predictions. I created a new data frame which I believe would be a spark data frame with the selected columns. I then changed it to pandas data frame, did my operations to create rank. And Now I am trying to save this pandas data frame to google cloud bucket — Tushar Mehta, Nov 05 '18 at 22:50
Hello Igor, from your suggestion above to write spark data frame I converted my pandas data frame to spark data frame and I was able to write it to google cloud storage bucket. However, I am still intrigued as to why I was not able to write the pandas data frame directly to bucket without converting it to spark data frame — Tushar Mehta, Nov 05 '18 at 22:56

score 2 · Answer 1 · answered Nov 06 '18 at 03:22

2

The easiest should be to convert Pandas DataFrame to Spark DataFrame and write it to GCS.

Here's instructions on how to do this: https://stackoverflow.com/a/45495969/3227693

answered Nov 06 '18 at 03:22

Igor Dvorzhak

4,360
3
17
31

Thanks Igor, this did solve the issue, however I was wondering why I was not able to pandas data frame directly – Tushar Mehta Nov 07 '18 at 02:05
Pandas can not write to GCS because there are no Pandas-specific GCS connector/library that can write to GCS. Because Spark, in contrast to Pandas, integrates with Hadoop it can use [GCS connector](https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs) for Hadoop, that deployed on all Dataproc clusters by default, to write to GCS. – Igor Dvorzhak Nov 07 '18 at 15:03
1

Thanks Igor for the explanation. I understand that now and thanks for all the help – Tushar Mehta Nov 07 '18 at 20:28

Upload Pandas DataFrame to GCP Bucket for Dataproc

1 Answers1

Linked