pyspark writing csv file to S3 error

Question

I am using pyspark and I am having trouble writing to S3, but reading from S3 is not a problem.

this is my code:

dic = {'a': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 260, 'c4(%)': 4.79, 'c5': 78, 'c6': 352}, 'b': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 5, 'c4(%)': 0.09, 'c5': 2, 'c6': 280}, 'c': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 0, 'c4(%)': 0.0, 'c5': 0, 'c6': 267}}

df = pd.DataFrame(dic)

df.to_csv("s3://work/.../filename_2018-01-04_08:50:45.csv")

this is the error:

IOError: [Errno 2] No such file or directory: 's3://work/.../filename_2018-01-04_08:50:45.csv'

what is the problem?

I guess the `to_csv` method of the DF will be looking to write to a location in your local filesystem and failing because there is no such location locally. You need to create a Spark DF rather than a Pandas DF and then write to s3 — ags29, Jan 04 '18 at 10:12
@ags29 if I use spark dataFrame it's writing to parquet and I want a CSV file in S3. — HilaD, Jan 04 '18 at 10:13
no see below, you can use a format argument to save it as csv — ags29, Jan 04 '18 at 10:16

ags29 · Accepted Answer · 2018-01-04T10:24:42.717

4

See my comment above, you need to use a Spark DataFrame. One easy way to accomplish this would be to turn the index on the Pandas DF into a column and then convert to spark DF:

df2=sqlContext.createDataFrame(df.reset_index(drop=False))

Then use:

df2.write.save("s3://work/.../filename_2018-01-04_08:50:45.csv", format='csv', header=True)

edited Jan 04 '18 at 10:24

answered Jan 04 '18 at 10:16

ags29

2,621
1
8
14

I get this: TypeError: 'DataFrameWriter' object is not callable. – HilaD Jan 04 '18 at 10:19
sorry my mistake, amended code above, typing ran ahead of my brain :) Try that and let me know if it works (may require a bit of tweaking as I do not have access to Spark right now to check-but basically should be correct) – ags29 Jan 04 '18 at 10:21
also from recall, syntax will differ depending whihc version of Spark you are using, let me know if that works for you – ags29 Jan 04 '18 at 10:26
1

It does not save as a single CSV file, only as a folder like parquet. – HilaD Jan 04 '18 at 10:29
OK, so you can write something like `df2.coalesce(1).write...` to coalesce to a single partition and then write (from recall this will be a folder with a single file under it). This is, however, not scalable and I would ask if this is really what you want to do? – ags29 Jan 04 '18 at 10:42
alternatively, you can avoid pyspark altogether and save your original Pandas DF to s3 using the `boto` library in Python. – ags29 Jan 04 '18 at 10:44
It still gives me a folder, but with one file. so thank you. but if there is a way to simplify it would be best. – HilaD Jan 04 '18 at 10:46
1

I do not believe what you are asking for can be done with Spark, see https://stackoverflow.com/questions/43661660/spark-how-to-write-a-single-csv-file-without-folder – ags29 Jan 04 '18 at 10:53
so best to use boto – ags29 Jan 04 '18 at 10:55

pyspark writing csv file to S3 error

1 Answers1