1

I am just getting started with Pyspark and would like to save a file as a csv instead of a text file. I tried using a couple of answers I found on Stack Overflow such as

def toCSVLine(data):
   return ','.join(str(d) for d in data)

and then

 rdd = lines.map(toCSVLine)
 rdd.saveAsTextFile("file.csv")

It works in that I can open it in excel, however all the information is put into column A in the spreadsheet. I would like to be able to put each column in the rdd (an example would be ("ID", "rating") into a separate column in excel so ID would be in column A and rating would be in column B. Is there a way to do this?

karlson
  • 5,325
  • 3
  • 30
  • 62
tclaw46
  • 25
  • 1
  • 6

3 Answers3

1

In excel are you splitting the file on the ','?

In excel go to the Data tab and select text to columns under data tools then select delimited and hit next. Then select comma as the delimiter and hit finish.

Edit

Generally it would be best practice to create a csv with a different separator character than comma if commas will be in your data. Per your comment if you are creating the csv, just use a different separator (e.g. ';', '|', '^', or tabs). Another option, which I prefer less, is to wrap the field in question in "" like so:

field0,field1,"field,2",field3

Excel should leave what is in quotes alone and only split on commas outside of the quotes. But again this is not my preferred solution.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • I did try this method, but it split up one of the columns where there was a line of text with multiple commas in it into multiple columns. Is there a way to use this method while ignoring commas inside quotes? – tclaw46 Feb 22 '17 at 16:06
1

If you're on Spark >= 2.0 and assuming your RDD has a tabular format (which it should, given you want to save it as CSV) one way might be to first create a Dataframe from the RDD and then use DataFrameWriter to export to CSV.

from pyspark.sql import SparkSession

spark = SparkSession(sc).getOrCreate()

df = spark.createDataframe(rdd)

df.write.csv("/path/to/file.csv", sep=',', header=True)

Have a look at the pyspark.sql docs for additional options and further information.

karlson
  • 5,325
  • 3
  • 30
  • 62
  • Is there a way to use a comma as a delimiter while ignoring commas inside quotes? – tclaw46 Feb 22 '17 at 16:47
  • Commas inside strings should be fine, as strings should be automatically quoted. Have you tried the options `sep`, `quote`, `escape` etc? – karlson Feb 22 '17 at 17:04
0

One option is to convert RDD to dataframe and then save as CSV.

from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')

Please see this post I just made: How to write the resulting RDD to a csv file in Spark python

Community
  • 1
  • 1
Insilico
  • 866
  • 9
  • 10
  • This did work, but is there a way to get it to ignore commas inside quotes? It split up a line of text I need to keep together into multiple columns because it has some commas in it. – tclaw46 Feb 22 '17 at 17:10
  • You can probably use another delimiter with .option("delimiter", "|") and use that delimiter when opening is Excel – Insilico Feb 22 '17 at 19:55
  • Alright I was able to get it to work with this code. Thank you for your help. – tclaw46 Feb 23 '17 at 03:35
  • The `coalesce(1)` here is problematic though. What if your data does not fit into one partition? I think it's smarter to skip the `coalesce` and concat the resulting files using other tools (e.g. `cat` on the commandline, or your standard python file operations) – karlson Nov 20 '17 at 09:37