How to write a pyspark dataframe with commas within a field in a csv file

Question

I have a Pyspark dataframe that has commas in one of the field. Sample data:

+--------+------------------------------------------------------------------------------------+
|id      |reason                                                                              |
+--------+------------------------------------------------------------------------------------+
|123-8aab|Request for the added "Hello Ruth, How are, you, doing and Other" abc. Thanks!      |
|125-5afs|Hi Prachi, I added an "XYZA", like josa.coam, row. "Uid to be eligible" for clarity.|
+--------+------------------------------------------------------------------------------------+

When I am writing this in csv, the data is spilling on to the next column and is not represented correctly. Code I am using to write data and output:

df_csv.repartition(1).write.format('csv').option("header", "true").save(
        "s3://{}/report-csv".format(bucket_name), mode='overwrite')

How data appears in csv:

Any help would really be appreciated. TIA.

NOTE : I think if the field has just commas, its exporting properly, but the combination of quotes and commas is what is causing the issue.

score 1 · Answer 1 · answered Jul 27 '21 at 03:34

What worked for me was-->

df_csv.repartition(1).write.format('csv').option("header", "true").option("quote", "\"").option("escape", "\"").save("s3://{}/report-csv".format(bucket_name), mode='overwrite')

More detailed explanation in this post: Reading csv files with quoted fields containing embedded commas

How to write a pyspark dataframe with commas within a field in a csv file

1 Answers1