0

Below is an example of csv file I'm working on:

life id,policy id,benefit id,date of commencment,status
xx_0,0,0,11/11/2017,active
xx_0,0,0,12/12/2017,active
axb_0,1,0,10/01/2015,active
axb_0,1,0,11/10/2014,active
fxa_2,0,1,01/02/203,active

What I want to do is to groupby (lifeid + policyid + benefitid) the data and sort by the date and then take the recent (last) element of each group to do some controls to it.

What's the best way to do this on spark?

philantrovert
  • 9,904
  • 3
  • 37
  • 61

1 Answers1

1

The best way to do it in spark is probably using dataframes (see How to select the first row of each group?). Yet I read that you want to avoid using them. A pure RDD solution could be written as follows:

val rdd = sc.parallelize(Seq("xx_0,0,0,11/11/2017,active",
    "xx_0,0,0,12/12/2017,active",
    "axb_0,1,0,10/01/2015,active",
    "axb_0,1,0,11/10/2014,active",
    "fxa_2,0,1,01/02/203,active"))

rdd
    .map(_.split(","))
    .map(x=> x.slice(0,3).reduce(_+","+_) -> 
        (new SimpleDateFormat("dd/MM/yyyy").parse(x(3)).getTime, x(4)))
    .reduceByKey((a,b) => if(a._1 > b._1) a else b)
    .map(x=> x._1+","+x._2._1+","+x._2._2)
    .collect.foreach(println)
Oli
  • 9,766
  • 5
  • 25
  • 46