GroupBy and Concatenate rows of DataFrame for Apache Spark in Java

Question

I have a DataFrame with this schema:

id      user        keywords
1       u1, u2      key1, key2  
1       u3, u4      key3, key4
1       u5, u6      key5, key6
2       u7, u8      key7, key8
2       u9, u10     key9, key10
3       u11, u12    key11, key12
3       u13, u14    key13, key14

I need a method to group Rows by id and concatenate the strings in user and keywords columns, to make it look like this:

id      user                            keywords
1       u1, u2, u3, u4, u5, u6          key1, key2, key3, key4, key5, key6
2       u7, u8, u9, u10                 key7, key8, key9, key10
3       u11, u12, u13, u14              key11, key12, key13, key14

How do I do that in Java?

What have you tried to do? In this site you should ask for answers to problem you encounter, not a solution for a work to be done... — mgaido, Jun 01 '16 at 09:28
I've been trying to work on JavaRDD, convert it to JavaPairRDD and apply ReduceByKey and Aggregation, but with no success. I thought there might be a better solution that could be applied directly on a dataframe, which i don't know how. — Sparkan, Jun 01 '16 at 09:31
I am not sure. I am having troubles understanding that suggested solution in python. It seems the UserDefinedAggregateFunction does not exist in spark 1.6.1 in Java. — Sparkan, Jun 01 '16 at 09:40

score 0 · Answer 1 · answered Jun 01 '16 at 09:48

0

Do something like:

create an RDD with (user, (authors, keywords)
groupByKey on this RDD
to some flatMap on authors and keywords

answered Jun 01 '16 at 09:48

Benjamin

3,350
4
24
49

GroupBy and Concatenate rows of DataFrame for Apache Spark in Java

1 Answers1