Split a spark dataframe with row delimiter

Question

Here is my dataframe

The base RDD that this dataframe works on is zipped with index. I'd like to split this original dataframe into multiple dataframes where the delimiter is based on the first column string (eg. "GCKN" in this case).

I assume if I get the individual dataframes, I can combine other values as well such as this..

A                                                    F     G
GCKN:GCKN_cppr0/in:GCKN_cppr0/out:GCKN_cppr15/in..  -71    531

Is this possible. What is the best way to do this?

can spark aggregate function be used here? I'm still figuring out the semantics. Please let me know if anyone has tried. — user1384205, Jun 24 '16 at 06:42

score 0 · Answer 1 · edited May 23 '17 at 12:08

Thanks to https://stackoverflow.com/a/32750733/1384205 I was able to achieve this by mapping the rdd with UDF.

Added row id. This will increment the id when the delimiter is found

       .map(x => {
          if(flag==true)  cnt+=1
          if(x.startsWith("GCKN,")){ flag=true } else flag=false      
          (cnt + "," +  x)
        })

followed by

val eprGroupedDF1 = eprDF1
  .groupBy("sIndex")
  .agg(GroupConcat(eprDF1.col("A")),sum("B").alias("B"))
  .sort("sIndex")

Split a spark dataframe with row delimiter

1 Answers1