How to read a CSV file with commas within a field using pyspark?

Question

I have a csv file containing commas within a column value. For example,

Column1,Column2,Column3    
123,"45,6",789

The values are wrapped in double quotes when they have extra commas in the data. In the above example, the values are Column1=123, Column2=45,6 and Column3=789 But, when trying to read the data, it gives me 4 values because of extra comma in Column2 field.

How to get the right values when reading this data in PySpark? I am using Spark 1.6.3

I am currently doing the below to create a rdd and then a data frame from rdd.

rdd = sc.textFile(input_file).map(lambda line: line.split(','))
df = sqlContext.createDataFrame(rdd)

Tobi · Answer 1 · 2018-10-09T16:24:00.383

4

You can directly read it to an DF using an SQLContext:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv')
    .options(header='true', inferschema='true', quote='"', delimiter=',')
    .load(input_file)

As Delimiter ',' and Quote '"' are the defaults you can also omit them. Commas inside quotes are ignored by default. An description of the parameters can be found here: https://github.com/databricks/spark-csv

Edit:

Without relying on Databricks, I can only think of a more tricky solution - this might not be the best approach:

Replace commas in numbers with points
Split using remaining commas

So, you could keep your original code, and add the REGEX replace

import re
rdd = sc.textFile(input_file).map(lambda line: (re.sub(r'\"(\d+),(\d+)\"',r'\1.\2', line)).split(','))
df.sqlContext.createDataFrame(rdd)

The supplied REGEX also gets rid of the double-quotes.

edited Oct 09 '18 at 16:24

answered Oct 08 '18 at 15:08

Tobi

414
3
9

Thanks for the response Tobi. Are you saying we can't do this with Spark 1.6. Your suggestion works only when I upgrade spark to 2.0. Please advice. – Bob Oct 08 '18 at 15:41
The link I provided specifically talks about Spark 1.6. So this should work with Spark 1.6. – Tobi Oct 08 '18 at 16:53
I looked at it. Yeah it is for Spark 1.6.. but Is there way to do this without using databricks libraries ? – Bob Oct 09 '18 at 14:52

How to read a CSV file with commas within a field using pyspark?

1 Answers1

Linked