I have a csv file containing commas within a column value. For example,
Column1,Column2,Column3
123,"45,6",789
The values are wrapped in double quotes when they have extra commas in the data. In the above example, the values are Column1=123, Column2=45,6 and Column3=789 But, when trying to read the data, it gives me 4 values because of extra comma in Column2 field.
How to get the right values when reading this data in PySpark? I am using Spark 1.6.3
I am currently doing the below to create a rdd and then a data frame from rdd.
rdd = sc.textFile(input_file).map(lambda line: line.split(','))
df = sqlContext.createDataFrame(rdd)