You need not to create RDD
. You can filter the data using dataframe
itself.
You have below df
+-------+-----------+
|country| year|
+-------+-----------+
| India| 07-01-2009|
| USA| 07-01-2010|
| USA| 01-01-2008|
| India| 07-01-2010|
| Canada| 07-01-2009|
| Canada| 02-03-2018|
+-------+-----------+
Create one more column
filter_year
val newdf=df.withColumn("filter_year",substring(df.col("year"),8,10))
+-------+-----------+-----------+
|country| year|filter_year|
+-------+-----------+-----------+
| India| 07-01-2009| 2009|
| USA| 07-01-2010| 2010|
| USA| 01-01-2008| 2008|
| India| 07-01-2010| 2010|
| Canada| 07-01-2009| 2009|
| Canada| 02-03-2018| 2018|
+-------+-----------+-----------+
Now apply filter
condition and drop
the new added column
val ans=newdf.filter("filter_year in (2009,2010)").drop("filter_year")
+-------+-----------+
|country| year|
+-------+-----------+
| India| 07-01-2009|
| USA| 07-01-2010|
| India| 07-01-2010|
| Canada| 07-01-2009|
+-------+-----------+
If you have RDD
of given data then you can do like below
val rdd=spark.read.format("csv").option("header","true").option("delimiter",",").load("C:\\spark\\programs\\temp.csv").rdd
RDD will be look like this
Array[org.apache.spark.sql.Row] = Array([India, 07-01-2009], [USA, 07-01-2010], [USA, 01-01-2008], [India, 07-01-2010], [Canada, 07-01-2009], [Canada, 02-03-2018])
You need to write only below line of code carefully for your dataset
val yearList=List(2009,2010)
rdd.filter(Row=>yearList.contains(Row(1).toString.trim.split("-")(2).toInt)).collect
You will get your desire output like below
Array[org.apache.spark.sql.Row] = Array([India, 07-01-2009], [USA, 07-01-2010], [India, 07-01-2010], [Canada, 07-01-2009])