scala filter operation to filter the CSV file

Question

How can use Scala filter operation to filter only records/rows that contains the year 2009 and 2010.

Basically, I want to filter the csv file using Scala Filter operation, so pick all the rows that has below date column.

07/01/2008
07/01/2009

I'm creating and rdd by reading the data from csv and now I want to filter the data who's year in 2009, 2010

val data = spark.read.csv("D:\\data.csv").rdd

Appreciate your help.

did you see examples on the internet of the filter function? almost every tutorial shows how to use filter. — Knows Not Much, Jun 16 '18 at 06:32
Possible duplicate of [Filtering a spark dataframe based on date](https://stackoverflow.com/questions/31994997/filtering-a-spark-dataframe-based-on-date) — Ramesh Maharjan, Jun 16 '18 at 06:39

Manoj Kumar Dhakad · Accepted Answer · 2018-06-16T11:04:44.227

You need not to create RDD. You can filter the data using dataframe itself.

You have below df

+-------+-----------+
|country|       year|
+-------+-----------+
|  India| 07-01-2009|
|    USA| 07-01-2010|
|    USA| 01-01-2008|
|  India| 07-01-2010|
| Canada| 07-01-2009|
| Canada| 02-03-2018|
+-------+-----------+

Create one more column filter_year

val newdf=df.withColumn("filter_year",substring(df.col("year"),8,10))

+-------+-----------+-----------+
|country|       year|filter_year|
+-------+-----------+-----------+
|  India| 07-01-2009|       2009|
|    USA| 07-01-2010|       2010|
|    USA| 01-01-2008|       2008|
|  India| 07-01-2010|       2010|
| Canada| 07-01-2009|       2009|
| Canada| 02-03-2018|       2018|
+-------+-----------+-----------+

Now apply filter condition and drop the new added column

val ans=newdf.filter("filter_year in (2009,2010)").drop("filter_year")

+-------+-----------+
|country|       year|
+-------+-----------+
|  India| 07-01-2009|
|    USA| 07-01-2010|
|  India| 07-01-2010|
| Canada| 07-01-2009|
+-------+-----------+

If you have RDD of given data then you can do like below

val rdd=spark.read.format("csv").option("header","true").option("delimiter",",").load("C:\\spark\\programs\\temp.csv").rdd

RDD will be look like this

 Array[org.apache.spark.sql.Row] = Array([India, 07-01-2009], [USA, 07-01-2010], [USA, 01-01-2008], [India, 07-01-2010], [Canada, 07-01-2009], [Canada, 02-03-2018])

You need to write only below line of code carefully for your dataset

 val yearList=List(2009,2010)

 rdd.filter(Row=>yearList.contains(Row(1).toString.trim.split("-")(2).toInt)).collect

You will get your desire output like below

Array[org.apache.spark.sql.Row] = Array([India, 07-01-2009], [USA, 07-01-2010], [India, 07-01-2010], [Canada, 07-01-2009])

Thanks Manoj using the DataFrame seems to best the solution. — Anil Kumar, Jul 04 '18 at 08:52

scala filter operation to filter the CSV file

1 Answers1