-1

I am reading a CSV file through PySpark. It is a caret delimited file. It has 5 columns. I need only 3 columns of it.

rdd = sc.textFile("test.csv").map(lambda x: x.split("^")).filter(lambda x: len(x)>1).map(lambda x: (x[0], x[2], x[3]))

print rdd.take(5)

As shown below the data in the csv file has a multiline data at the 4th record, last but one column. Due to which though the file is having only 5 records spark is treating it as 6 records. So I am facing the index out of range error.

Data in file.csv:

a1^b1^c1^d1^e1
a2^b2^c2^d2^e2
a3^b3^c3^d3^e3
a4^b4^c4^d4 is 
multiline^e4
a5^b5^c5^d5^e5

How to enable the multiline while creating the rdd through sc.textFile()?

Sri
  • 623
  • 3
  • 9
  • 22
  • I see the examples online to enable multiline when we are creating rdd as spark.read.csv.option("multiLine", "true").('file.csv') But I could not find anywhere for sc.textFile() – Sri Nov 18 '18 at 07:05
  • You mean you want to read just 5 columns from your text file? – Ali AzG Nov 18 '18 at 07:11
  • Lets say as described below there are 5 columns in a file with only 4 records . And I am reading only last but one column. If you see the last record has multiline for the last but one column. Due to which I am getting a error. `a1^b1^c1^d1^e1 a2^b2^c2^d2^e2 a3^b3^c3^d3^e3 a4^b4^c4^d4 is very lenghty^e4` – Sri Nov 18 '18 at 07:24
  • @Sri - your question isn't clear. Can you please [edit] your question, and updated your question with a sample of input, expected output. The code you are running and the code output. – Yaron Nov 18 '18 at 08:56
  • @Yaron I modified the question, please let me know if I can more details on it, so that you can respond to the question. – Sri Nov 18 '18 at 10:31
  • Possible duplicate of [How to read whole file in one string](https://stackoverflow.com/questions/30445263/how-to-read-whole-file-in-one-string) – 10465355 Nov 18 '18 at 10:49
  • @user10465355 wholeTextFile is different from textFile. Here is the link for the same. "https://stackoverflow.com/questions/47129950/spark-textfile-vs-wholetextfiles". The issue that I am talking about is textFile. – Sri Nov 18 '18 at 11:25
  • @Sri - Which version of spark are you using? – Yaron Nov 18 '18 at 12:07
  • 1
    @Sri - why do you wish to work with `sc.textFile` / `rdd` instead of using `spark.read.csv.option("multiLine", "true").('file.csv')` ? – Yaron Nov 18 '18 at 12:08
  • @Yaron This would work only with quoted strings, wouldn't it? – 10465355 Nov 18 '18 at 12:32
  • @Sri `textFile` doesn't read multiple lines (it reads data only line-by-line [for some unambiguous `delimiter`](https://stackoverflow.com/q/31227363/10465355)). That's what `wholeTextFiles` is for. – 10465355 Nov 18 '18 at 12:34
  • @Yaron the reason for going to sc.textFile is actually the csv file is having 110 columns. If I choose `spark.read.csv()` I have to create the schema with `StructType` for those 110 columns. To avoid I am choosing `sc.textFile()` and then loading only few columns with the index. – Sri Nov 18 '18 at 18:41

2 Answers2

0

In my analysis I came to know that, It cannot be done through sc.textFile(), the reason for this is as soon as we load the s3 file to rdd, then rdd will have list of elements as each record of a s3 file. At this level itself each line with in the multiline is split into different records. So it cannot be achieved through sc.textFile().

Sri
  • 623
  • 3
  • 9
  • 22
0
from pyspark.sql.session import SparkSession

spark = SparkSession(sc)
rdd = spark.read.csv("csv.csv", multiLine=True, header="False",sep = "^", escape= "\"")
Val
  • 345
  • 2
  • 14
  • 1
    While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Alessio Sep 20 '19 at 13:32