1

I have a Pyspark dataframe that has commas in one of the field. Sample data:

+--------+------------------------------------------------------------------------------------+
|id      |reason                                                                              |
+--------+------------------------------------------------------------------------------------+
|123-8aab|Request for the added "Hello Ruth, How are, you, doing and Other" abc. Thanks!      |
|125-5afs|Hi Prachi, I added an "XYZA", like josa.coam, row. "Uid to be eligible" for clarity.|
+--------+------------------------------------------------------------------------------------+

When I am writing this in csv, the data is spilling on to the next column and is not represented correctly. Code I am using to write data and output:

df_csv.repartition(1).write.format('csv').option("header", "true").save(
        "s3://{}/report-csv".format(bucket_name), mode='overwrite')

How data appears in csv:

enter image description here

Any help would really be appreciated. TIA.

NOTE : I think if the field has just commas, its exporting properly, but the combination of quotes and commas is what is causing the issue.

Gunjan Khandelwal
  • 179
  • 1
  • 2
  • 13

1 Answers1

1

What worked for me was-->

df_csv.repartition(1).write.format('csv').option("header", "true").option("quote", "\"").option("escape", "\"").save("s3://{}/report-csv".format(bucket_name), mode='overwrite')

More detailed explanation in this post: Reading csv files with quoted fields containing embedded commas

Gunjan Khandelwal
  • 179
  • 1
  • 2
  • 13