2

I am working with a really large array of size (>10 Million x 2) that is in the following format

a = [[1, 1.3],
     [5, 56.3],
     [6, 6.4],
     [12, 18],
     .
     . 
     .]

type(a) = numpy.ndarray Basically, column 1 is the index and column 2 is the associated value. Now, I would like to group this array based on a list of index I provide.

The groups are just a simple list/array of indices:

g1 = [6,1,...]
g2 = [12,5,...]

Notice the indices within groups are not in the same order as the indices in my giant array. I could also have more than 2 groups.

Eventually, my goal is to sum up the values of the index in the provided group. But, right now, I am just trying to group this without aging a few years while my code runs.

Result for intermediate step should be

a1 = [[6, 6.4],
     [1, 1.3]] 
a2 = [[12, 18],
     [5, 56.3]] 

Finally the easy part, summing up the values (col 2) for all the indices in the group a1_sum = 7.7 and a2_sum = 74.3.

I am reading a from an HDF5 file. From my previous experiences --which is not a whole lot --working from HDF5 directly is slower so I just copy the entire array and work with it. This approach was perfectly fine when my data was smaller, ~500k rows. With this large array, making a copy takes 10 s. I am okay with it but then trying to do what I explained above is just painfully slow.

As you would expect, for loop-ing >10 million rows of the array will take whatever time I have left in this planet. And for-loop is the only way I can think of accomplishing this. I would have to loop through each entry in col_1 of a and compare it all lists to find out in which list the index exists and then group. I'm having really hard time to coming up with ways I vectorize my solution approach since vectorizing in Numpy is much faster than looping.

The end result is the sum value. So perhaps there is way to achieve that without grouping?

I would appreciate any help you can provide. Thank you!

Sushi
  • 21
  • 2
  • Isn't `a[g1]` and `a[g2]` what you are looking for? If you only want the sum and nothing else, then this is not the best solution but at least this is much faster than a loop. – Jérôme Richard Dec 12 '22 at 18:06
  • @JérômeRichard g1and g2 are a list of column one values --not a list/array index. Sorry if I didn't make that clear in my post. – Sushi Dec 12 '22 at 18:36
  • @JérômeRichard The array I was working with consists of integer index(col_1) and float values (col_2) which is why I was referring to then entries in g1 and g2 as "index" but they are really just integers that are a subset of col1, not positional index values. Hopefully that clears it up. Sorry for the confusion again. – Sushi Dec 12 '22 at 18:47
  • Does this answer your question? [How do I Pandas group-by to get sum?](https://stackoverflow.com/questions/39922986/how-do-i-pandas-group-by-to-get-sum) – Homer512 Dec 12 '22 at 18:53
  • @Homer512 The solution presented there is only valid for Pandas dataframe, right? I am working with Numpy arrays. – Sushi Dec 12 '22 at 19:30
  • You can drop numpy arrays into pandas frames in one line – Homer512 Dec 12 '22 at 19:49
  • If Pandas is only the problem, then it is certainly wise to read previous post about that. For example, https://stackoverflow.com/questions/49141969 and https://stackoverflow.com/questions/48686381. Are these posts useful? – Jérôme Richard Dec 12 '22 at 19:51
  • Thank you, both! Those previous posts are leading me down the right path and I'm making progress. – Sushi Dec 13 '22 at 17:11
  • Since you are reading from an HDF5 file, you should consider PyTables. If the data is in a Table object (not an Array object) PyTables can quickly query and return as the desired array of indices and values. There are several (recent) answers that who how to use `Table.get_where_list()` and `Table.read_coordinates()` together to accomplish this task. – kcw78 Dec 13 '22 at 20:16
  • @kcw78 I use PyTables to read in the HDF5 but I typically convert it to Numpy array right away and work with it. Which sounds inefficient and it most likely is. So far, my data size wasn't bad so I didn't have speed issues and I was comfortable working with Numpy only. Now, I'm realizing how inefficient my methods were. Thanks for pointing me to those two methods. I will check 'em out. – Sushi Dec 13 '22 at 23:31
  • Here is an answer that shows how to use the 2 functions together: [What is the best way to query a pytable column with many values?](https://stackoverflow.com/a/74468943/10462884) – kcw78 Dec 14 '22 at 13:43

0 Answers0