Group By columns and Sort csv file spark

Question

Below is an example of csv file I'm working on:

life id,policy id,benefit id,date of commencment,status
xx_0,0,0,11/11/2017,active
xx_0,0,0,12/12/2017,active
axb_0,1,0,10/01/2015,active
axb_0,1,0,11/10/2014,active
fxa_2,0,1,01/02/203,active

What I want to do is to groupby (lifeid + policyid + benefitid) the data and sort by the date and then take the recent (last) element of each group to do some controls to it.

What's the best way to do this on spark?

I'm not selecting the first element of each group, and I'm trying to avoid to use dataframes. — Wajdi Ben Abderrahim, May 18 '18 at 14:48
I'm already using JavaRDD and I sorted the data by the date to do other controls , so question of performence I want to use the same sorted rdd by grouping it and taking the recent line if each group. — Wajdi Ben Abderrahim, May 18 '18 at 14:54
Would you mind checking how to create [mcve] and providing one? Thanks in advance. — Alper t. Turker, May 18 '18 at 15:07

score 1 · Accepted Answer · answered May 18 '18 at 15:59

The best way to do it in spark is probably using dataframes (see How to select the first row of each group?). Yet I read that you want to avoid using them. A pure RDD solution could be written as follows:

val rdd = sc.parallelize(Seq("xx_0,0,0,11/11/2017,active",
    "xx_0,0,0,12/12/2017,active",
    "axb_0,1,0,10/01/2015,active",
    "axb_0,1,0,11/10/2014,active",
    "fxa_2,0,1,01/02/203,active"))

rdd
    .map(_.split(","))
    .map(x=> x.slice(0,3).reduce(_+","+_) -> 
        (new SimpleDateFormat("dd/MM/yyyy").parse(x(3)).getTime, x(4)))
    .reduceByKey((a,b) => if(a._1 > b._1) a else b)
    .map(x=> x._1+","+x._2._1+","+x._2._2)
    .collect.foreach(println)

Group By columns and Sort csv file spark

1 Answers1