0

Let's say I have next vector sparse and dt_diff(integer value). So, I need aggregation values by cuid and sum these values

+-----------------------------------+--------------------------------------+---------------+
|               cuid                |              features                |   dt_diff     |
+-----------------------------------+--------------------------------------+---------------+
| 12654467                          |(2013492,[1743933,2013491],[2.0,2.0]) |      4        |                     
| 12654467                          |(1876451,[1000000,1876451],[5.0,7.0]) |      10       |
+-----------------------------------+--------------------------------------+---------------+

So, output is

+-----------------------------------+--------------------------------------+---------------+
|               cuid                |              features                |   dt_diff     |
+-----------------------------------+--------------------------------------+---------------+
| 12654467                          |(3889943,[2743933,3889942],[7.0,9.0]) |      14       |                    
+-----------------------------------+--------------------------------------+---------------+
Oli
  • 9,766
  • 5
  • 25
  • 46
Jerdy
  • 25
  • 1
  • 6
  • 1
    Hi @Jerdy, can you share what you've tried so far ? – baitmbarek Dec 09 '19 at 15:02
  • @baitmbarek I've tried to do this - https://stackoverflow.com/questions/33899977/how-to-define-a-custom-aggregation-function-to-sum-a-column-of-vectors , but int this example author set custom vector size (3), and I must add size of vectors – Jerdy Dec 09 '19 at 15:46
  • what is your features' structure ? – baitmbarek Dec 09 '19 at 16:31
  • @baitmbarek like in my example. I have one more column with feature, which I translated in One Hot Encoded and get one more Vector sparse – Jerdy Dec 09 '19 at 16:42
  • You could use after the ``groupBy`` the ``agg`` and in there call ``functions.collect_list(column).as("whatever")`` (separate per commas to add multiple aggregations). Then you should be mapping with a custom ``MapFunction`` to deal with the result. – Tomás Denis Reyes Sánchez Dec 23 '19 at 16:52

0 Answers0