I am working with a really large array of size (>10 Million x 2) that is in the following format
a = [[1, 1.3],
[5, 56.3],
[6, 6.4],
[12, 18],
.
.
.]
type(a) = numpy.ndarray
Basically, column 1 is the index and column 2 is the associated value.
Now, I would like to group this array based on a list of index I provide.
The groups are just a simple list/array of indices:
g1 = [6,1,...]
g2 = [12,5,...]
Notice the indices within groups are not in the same order as the indices in my giant array. I could also have more than 2 groups.
Eventually, my goal is to sum up the values of the index in the provided group. But, right now, I am just trying to group this without aging a few years while my code runs.
Result for intermediate step should be
a1 = [[6, 6.4],
[1, 1.3]]
a2 = [[12, 18],
[5, 56.3]]
Finally the easy part, summing up the values (col 2) for all the indices in the group a1_sum = 7.7
and a2_sum = 74.3
.
I am reading a
from an HDF5 file. From my previous experiences --which is not a whole lot --working from HDF5 directly is slower so I just copy the entire array and work with it. This approach was perfectly fine when my data was smaller, ~500k rows. With this large array, making a copy takes 10 s. I am okay with it but then trying to do what I explained above is just painfully slow.
As you would expect, for loop-ing >10 million rows of the array will take whatever time I have left in this planet. And for-loop is the only way I can think of accomplishing this. I would have to loop through each entry in col_1 of a
and compare it all lists to find out in which list the index exists and then group. I'm having really hard time to coming up with ways I vectorize my solution approach since vectorizing in Numpy is much faster than looping.
The end result is the sum value. So perhaps there is way to achieve that without grouping?
I would appreciate any help you can provide. Thank you!