I am reading a CSV file through PySpark. It is a caret delimited file. It has 5 columns. I need only 3 columns of it.
rdd = sc.textFile("test.csv").map(lambda x: x.split("^")).filter(lambda x: len(x)>1).map(lambda x: (x[0], x[2], x[3]))
print rdd.take(5)
As shown below the data in the csv file has a multiline data at the 4th record, last but one column. Due to which though the file is having only 5 records spark is treating it as 6 records. So I am facing the index out of range error.
Data in file.csv:
a1^b1^c1^d1^e1
a2^b2^c2^d2^e2
a3^b3^c3^d3^e3
a4^b4^c4^d4 is
multiline^e4
a5^b5^c5^d5^e5
How to enable the multiline
while creating the rdd
through sc.textFile()
?