How to remove header by using filter function in spark?

Question

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :

val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));

and the error I am getting says "error not found value line "what could be the issue here with this code?

Did you check this question? If yes, how is yours different? https://stackoverflow.com/questions/27854919/how-do-i-skip-a-header-from-csv-files-in-spark — FurryMachine, Jul 20 '18 at 16:42
Yes, I did. Actually, My header is not a standard schema it is just another row but the fields define the different category . — Vin, Jul 20 '18 at 16:55
I'm not sure I understand your explanation. Would you care reviewing that please ? Add an example with some input and expected output. — eliasah, Jul 20 '18 at 19:30

thebluephantom · Accepted Answer · 2018-07-22T08:08:44.217

3

I don't think anybody answered the obvious, whereby line.contains also possible:

val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))

You were nearly there, just a syntax issue, but that is significant of course!

edited Jul 22 '18 at 08:08

answered Jul 22 '18 at 08:03

thebluephantom

16,458
8
40
83

score 0 · Answer 2 · answered Jul 20 '18 at 16:52

0

Using textFile as below:

val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))

Or

In Spark 2.0:

spark.read.option("header","true").csv("filePath")

answered Jul 20 '18 at 16:52

1pluszara

1,518
3
14
26

what if I already loaded the data from the file and created an RDD and now want to create another RDD where I take a part of data and remove the header from it? – Vin Jul 20 '18 at 17:00
1

Provide your sample input file and expected output – 1pluszara Jul 20 '18 at 17:25
Your last option with spark.read as stated will have some side effects that you do not mention – thebluephantom Jul 22 '18 at 08:15

How to remove header by using filter function in spark?

2 Answers2