1

I am using pyspark and I am having trouble writing to S3, but reading from S3 is not a problem.

this is my code:

dic = {'a': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 260, 'c4(%)': 4.79, 'c5': 78, 'c6': 352}, 'b': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 5, 'c4(%)': 0.09, 'c5': 2, 'c6': 280}, 'c': {'c1(%)': 0.0, 'c2': 0, 'c3($)': 0, 'c4(%)': 0.0, 'c5': 0, 'c6': 267}}

df = pd.DataFrame(dic)

df.to_csv("s3://work/.../filename_2018-01-04_08:50:45.csv")

this is the error:

IOError: [Errno 2] No such file or directory: 's3://work/.../filename_2018-01-04_08:50:45.csv'

what is the problem?

HilaD
  • 871
  • 5
  • 15
  • 23
  • I guess the `to_csv` method of the DF will be looking to write to a location in your local filesystem and failing because there is no such location locally. You need to create a Spark DF rather than a Pandas DF and then write to s3 – ags29 Jan 04 '18 at 10:12
  • @ags29 if I use spark dataFrame it's writing to parquet and I want a CSV file in S3. – HilaD Jan 04 '18 at 10:13
  • no see below, you can use a format argument to save it as csv – ags29 Jan 04 '18 at 10:16

1 Answers1

4

See my comment above, you need to use a Spark DataFrame. One easy way to accomplish this would be to turn the index on the Pandas DF into a column and then convert to spark DF:

df2=sqlContext.createDataFrame(df.reset_index(drop=False))

Then use:

df2.write.save("s3://work/.../filename_2018-01-04_08:50:45.csv", format='csv', header=True)
ags29
  • 2,621
  • 1
  • 8
  • 14
  • I get this: TypeError: 'DataFrameWriter' object is not callable. – HilaD Jan 04 '18 at 10:19
  • sorry my mistake, amended code above, typing ran ahead of my brain :) Try that and let me know if it works (may require a bit of tweaking as I do not have access to Spark right now to check-but basically should be correct) – ags29 Jan 04 '18 at 10:21
  • also from recall, syntax will differ depending whihc version of Spark you are using, let me know if that works for you – ags29 Jan 04 '18 at 10:26
  • 1
    It does not save as a single CSV file, only as a folder like parquet. – HilaD Jan 04 '18 at 10:29
  • OK, so you can write something like `df2.coalesce(1).write...` to coalesce to a single partition and then write (from recall this will be a folder with a single file under it). This is, however, not scalable and I would ask if this is really what you want to do? – ags29 Jan 04 '18 at 10:42
  • alternatively, you can avoid pyspark altogether and save your original Pandas DF to s3 using the `boto` library in Python. – ags29 Jan 04 '18 at 10:44
  • It still gives me a folder, but with one file. so thank you. but if there is a way to simplify it would be best. – HilaD Jan 04 '18 at 10:46
  • 1
    I do not believe what you are asking for can be done with Spark, see https://stackoverflow.com/questions/43661660/spark-how-to-write-a-single-csv-file-without-folder – ags29 Jan 04 '18 at 10:53
  • so best to use boto – ags29 Jan 04 '18 at 10:55