0

Here is my dataframe enter image description here

The base RDD that this dataframe works on is zipped with index. I'd like to split this original dataframe into multiple dataframes where the delimiter is based on the first column string (eg. "GCKN" in this case).

I assume if I get the individual dataframes, I can combine other values as well such as this..

A                                                    F     G
GCKN:GCKN_cppr0/in:GCKN_cppr0/out:GCKN_cppr15/in..  -71    531

Is this possible. What is the best way to do this?

user1384205
  • 1,231
  • 3
  • 20
  • 39
  • can spark aggregate function be used here? I'm still figuring out the semantics. Please let me know if anyone has tried. – user1384205 Jun 24 '16 at 06:42

1 Answers1

0

Thanks to https://stackoverflow.com/a/32750733/1384205 I was able to achieve this by mapping the rdd with UDF.

Added row id. This will increment the id when the delimiter is found

       .map(x => {
          if(flag==true)  cnt+=1
          if(x.startsWith("GCKN,")){ flag=true } else flag=false      
          (cnt + "," +  x)
        })

followed by

val eprGroupedDF1 = eprDF1
  .groupBy("sIndex")
  .agg(GroupConcat(eprDF1.col("A")),sum("B").alias("B"))
  .sort("sIndex")
Community
  • 1
  • 1
user1384205
  • 1,231
  • 3
  • 20
  • 39