Pyspark - reading a malformed CSV

Question

I have a CSV file which I want to read into DataFrame

Here is an example of my file (last column may contains string with spaces):

C1 C2 C3
  1  2 ab cd
 11 12 xx yz
5      6 mm nn pl

I tried to read this file using:

spark.read.csv("myFile",header=True, mode="DROPMALFORMED",sep=' ')

But It fails (all rows are malformed)

In order to succeed on reading this file, I need to update it first (remove spaces, add underscores,etc..):

C1 C2 C3
1 2 ab_cd
11 12 xx_yz
5 6 mm_nn_pl

Is there a way to read the file into CSV without changing it?

I also tried to use the attributes ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace without success.

spark.read.csv("myFile",header=True, mode="DROPMALFORMED",sep=' ', ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=True)

Thanks for the help

can you iterate through the csv and build a cleaned csv, and try loading that. it seems that you have the following corrections to make: remove leading spaces from each line. c1 & c2 contain text without any separator and c3 has text with spaces that have to replaced by underscore. are there other ways that rows invalid in this csv file? — Haleemur Ali, Nov 20 '17 at 14:34
Possible duplicate of [Get CSV to Spark dataframe](https://stackoverflow.com/questions/29936156/get-csv-to-spark-dataframe) . Read the file as a texfile in SparkContext first, transform your lines in your mapper function, then convert to df. — cowbert, Nov 20 '17 at 16:08

0 Answers0