-2

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :

val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));

and the error I am getting says "error not found value line "what could be the issue here with this code?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Vin
  • 515
  • 3
  • 17
  • Did you check this question? If yes, how is yours different? https://stackoverflow.com/questions/27854919/how-do-i-skip-a-header-from-csv-files-in-spark – FurryMachine Jul 20 '18 at 16:42
  • Yes, I did. Actually, My header is not a standard schema it is just another row but the fields define the different category . – Vin Jul 20 '18 at 16:55
  • I'm not sure I understand your explanation. Would you care reviewing that please ? Add an example with some input and expected output. – eliasah Jul 20 '18 at 19:30

2 Answers2

3

I don't think anybody answered the obvious, whereby line.contains also possible:

val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))

You were nearly there, just a syntax issue, but that is significant of course!

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0

Using textFile as below:

val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))

Or

In Spark 2.0:

spark.read.option("header","true").csv("filePath")
1pluszara
  • 1,518
  • 3
  • 14
  • 26