It's not good to just join by commas because if fields contain commas, they won't be properly quoted, e.g. ','.join(['a', 'b', '1,2,3', 'c'])
gives you a,b,1,2,3,c
when you'd want a,b,"1,2,3",c
. Instead, you should use Python's csv module to convert each list in the RDD to a properly-formatted csv string:
# python 3
import csv, io
def list_to_csv_str(x):
"""Given a list of strings, returns a properly-csv-formatted string."""
output = io.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip() # remove extra newline
# ... do stuff with your rdd ...
rdd = rdd.map(list_to_csv_str)
rdd.saveAsTextFile("output_directory")
Since the csv module only writes to file objects, we have to create an empty "file" with io.StringIO("")
and tell the csv.writer to write the csv-formatted string into it. Then, we use output.getvalue()
to get the string we just wrote to the "file". To make this code work with Python 2, just replace io with the StringIO module.
If you're using the Spark DataFrames API, you can also look into the DataBricks save function, which has a csv format.