5

I have a Spark dataframe that looks as follows,it filled with sparse Vector but not dense Vector:

+---+--------+-----+-------------+
|id |catagery|index|vec          |
+---+--------+-----+-------------+
|a  |ii      |3.0  |(5,[3],[1.0])|
|a  |ll      |0.0  |(5,[0],[1.0])|
|b  |dd      |4.0  |(5,[4],[1.0])|
|b  |kk      |2.0  |(5,[2],[1.0])|
|b  |gg      |5.0  |(5,[],[])    |
|e  |hh      |1.0  |(5,[1],[1.0])|
+---+--------+-----+-------------+

as we all know,if i try like this

val rr=result.groupBy("id").agg(sum("index")) scala> rr.show(false)

  +---+----------+                                                                
  |id |sum(index)|
  +---+----------+
  |e  |1.0       |
  |b  |11.0      |
  |a  |3.0       |
  +---+----------+

but how can I use "groupBy" and "agg" to sum Sparse Vector? I want the final dataFrame like this:

      +---+-------------------------+                              
      |id |   vecResult             |
      +---+-------------------------+
      |a  |(5,[0,3],[1.0,1.0])      |
      |b  |(5,[2,4,5],[1.0,1.0,1.0])|
      |e  |(5,[1],[1.0])            |
      +---+-------------------------+

I think VectorAssembler() may solve this, but I don't know how to write code, should I use udf?

LIUKUNPENG
  • 91
  • 3

0 Answers0