Calculate rolling sum of array in PySpark and save as dict?

Question

Given an input like this:

timestamp     vars 
2             [1,2,3]
2             [1,2,4]
3             [1,2]
4             [1,3]
5             [1,3]

I need to keep a rolling count of each of the indices. The tried expanding the array into a one hot encoding ([1,2,3,5] -> [0,1,1,1,0,1]) and adding but this can get arbitrarily big (> 1 million), so I want to maintain it as a dict. Something like below. Any pointers would be greatly appreciated.

timestamp     vars 
2             {1:1, 2:1, 3:1}
2             {1:2, 2:2, 3:1, 4:1}
3             {1:3, 2:3, 3:1, 4:1}
4             {1:4, 2:3, 3:2, 4:1}
5             {1:5, 2:3, 3:3, 4:1}

Thanks!

I would suggest you to maintain a hashmap and increment the count there and fetch it while creating the dictionary. — ashwin agrawal, Feb 26 '20 at 01:28

score 0 · Answer 1 · answered Feb 26 '20 at 11:01

Sample Dataframe :

+---+------------+
| ID|         arr|
+---+------------+
|  1|         [0]|
|  2|      [0, 1]|
|  3|   [0, 1, 2]|
|  4|[0, 1, 2, 3]|
|  1|         [0]|
|  1|         [0]|
|  3|   [0, 1, 2]|
|  0|          []|
+---+------------+

Using the following function which uses Collection counter:

def arr_operation(arr):
   from collections import Counter
   return dict(Counter(arr))

Creating UDF for arr_operation function in the following manner :

udf_dist_count =  udf(arr_operation,MapType(IntegerType(), IntegerType()))

And calling the to create a new column:

final_df = df.withColumn("Dict",udf_dist_count("arr"))

The results will be like :

+---+------------+--------------------------------+
|ID |arr         |Dict                            |
+---+------------+--------------------------------+
|1  |[0]         |[0 -> 1]                        |
|2  |[0, 1]      |[0 -> 1, 1 -> 1]                |
|3  |[0, 1, 2]   |[0 -> 1, 1 -> 1, 2 -> 1]        |
|4  |[0, 1, 2, 3]|[0 -> 1, 1 -> 1, 2 -> 1, 3 -> 1]|
|1  |[0]         |[0 -> 1]                        |
|1  |[0]         |[0 -> 1]                        |
|3  |[0, 1, 2]   |[0 -> 1, 1 -> 1, 2 -> 1]        |
|0  |[]          |[]                              |
+---+------------+--------------------------------+

The argument about collection Counter being slow in a distributed environment has been explained in a good manner in the answer to this question Why is Collections.counter so slow?

score -1 · Answer 2 · answered Feb 26 '20 at 01:31

I would suggest Counter from collections:

In [1]: from collections import Counter                                                                                                                             

In [2]: count = Counter()                                                                                                                                           

In [3]: count.update([1,2,4])                                                                                                                                       

In [4]: count                                                                                                                                                       
Out[4]: Counter({1: 1, 2: 1, 4: 1})

In [5]: count.update([1,2,3])                                                                                                                                       

In [6]: count                                                                                                                                                       
Out[6]: Counter({1: 2, 2: 2, 4: 1, 3: 1})

In [7]: count.update([2,3,5])                                                                                                                                       

In [8]: count                                                                                                                                                       
Out[8]: Counter({1: 2, 2: 3, 4: 1, 3: 2, 5: 1})

My suggestion is to use Counter. Does PySpark not include python's standard library? — salt-die, Feb 26 '20 at 02:03
@salt-die yes it does have the library but it will not work in a distributed manner. ur suggestion would be used in a UDF and they are extremely slow for big data tasks — murtihash, Feb 26 '20 at 02:11
You can create new Counters by adding to count if you don't want to mess with mutable objects. — salt-die, Feb 26 '20 at 02:15

Calculate rolling sum of array in PySpark and save as dict?

2 Answers2