I have a DataFrame that looks like this:
scala> data.show
+-----+---+---------+
|label| id| features|
+-----+---+---------+
| 1.0| 1|[1.0,2.0]|
| 0.0| 2|[5.0,6.0]|
| 1.0| 1|[3.0,4.0]|
| 0.0| 2|[7.0,8.0]|
+-----+---+---------+
I want to regroup the features based on "id" so I can get the following:
scala> data.show
+---------+---+-----------------+
| label| id| features |
+---------+---+-----------------+
| 1.0,1.0| 1|[1.0,2.0,3.0,4.0]|
| 0.0,0.0| 2|[5.0,6.0,7.8,8.0]|
+---------+---+-----------------+
This is the code I am using to generate the mentioned DataFrame
val rdd = sc.parallelize(List((1.0, 1, Vectors.dense(1.0, 2.0)), (0.0, 2, Vectors.dense(5.0, 6.0)), (1.0, 1, Vectors.dense(3.0, 4.0)), (0.0, 2, Vectors.dense(7.0, 8.0))))
val data = rdd.toDF("label", "id", "features")
I have been trying different things with both RDD and DataFrames. The most "promising" approach so far has been to filter based on "id"
data.filter($"id".equalTo(1))
+-----+---+---------+
|label| id| features|
+-----+---+---------+
| 1.0| 1|[1.0,2.0]|
| 1.0| 1|[3.0,4.0]|
+-----+---+---------+
But I have two bottlenecks now:
1) How to automatize the filtering for all distinct values that "id" could have?
The following generates an error:
data.select("id").distinct.foreach(x => data.filter($"id".equalTo(x)))
2) How to concatenate common "features" respect to a given "id". Have not tried much since I am still stuck on 1)
Any suggestion is more than welcome
Note: For clarification "label" is always the same for every occurrence of "id". Sorry for the confusion, a simple extension of my task would be also to group the "labels" (updated example)