I have a .tsv
file pageviews_by_second
consisting of the timestamp
site
and requests
fields:
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
"2015-03-16T00:19:39" "desktop" 2460
I want the first row to be gone, because it leads to errors in the operations I have to perform on the data.
I tried doing it in the following ways:
1.Filtering the RDD before splitting it
val RDD1 = sc.textFile("pageviews_by_second")
val top_row = RDD1.first()
//returns: top_row: String = "timestamp" "site" "requests"
val RDD2 = RDD1.filter(x => x!= top_row)
RDD2.first()
//returns: "2015-03-16T00:09:55" "mobile" 1595
2.Filtering the RDD after splitting it
val RDD1 = sc.textFile("pageviews_by_second").map(_.split("\t")
RDD1.first() //returns res0: Array[String] = Array("timestamp, 'site", "requests")
val top_row = RDD1.first()
val RDD2 = RDD1.filter(x => x!= top_row)
RDD2.first() //returns: res1: Array[String] = Array("timestamp", "site" ,"requests")
val RDD2 = RDD1.filter(x => x(0)!="timestamp" && x(1)!="site" && x(2)!="requests")
RDD2.first() //returns: res1: Array[String] = Array("timestamp", "site" ,"requests")
3.Converting into a DataFrame using 'case class' and the filtering it
case class Wiki(timestamp: String, site: String, requests: String)
val DF = sc.textFile("pageviews_by_second").map(_.split("\t")).map(w => Wiki(w(0), w(1), w(2))).toDF()
val top_row = DF.first()
//returns: top_row: org.apache.spark.sql.Row = ["timestamp","site","requests"]
DF.filter(_ => _ != top_row)
//returns: error: missing parameter type
val DF2 = DF.filter(_ => _ != top_row2)
Why is only the 1st method able to filter out the first row while the other two aren't ? In method 3, why do I get the error and how can I rectify it ?