I have a complicated winodwing operation which I need help with in pyspark.
I have some data grouped by src
and dest
, and I need to do the following operations for each group:
- select only rows with amounts in socket2
which do not appear in socket1
(for all rows in this group)
- after applying that filtering criteria, sum amounts in amounts
field
amounts src dest socket1 socket2
10 1 2 A B
11 1 2 B C
12 1 2 C D
510 1 2 C D
550 1 2 B C
500 1 2 A B
80 1 3 A B
And I want to aggregate it in the following way:
512+10 = 522, and 80 is the only record for src=1 and dest=3
amounts src dest
522 1 2
80 1 3
I borrowed the sample data from here: How to write Pyspark UDAF on multiple columns?