group and sum a numpy array based on the position of the columns in another array

Question

What my data looks like

I run certain numerical simulations in numba.

The output is a set of numpy arrays; each array represents a metric, and each array has shape (periods x items).

E.g. metric_1[p,i] tells me the value of metric_1 at time p for item i.

Each item belongs to a certain category - let's say red and green just for the sake of an example. The one-dimensional array categories tells me exactly this - e.g. categories[0]='a' means the first item belongs to category a. Conceptually, this is like having a pandas multi-index "flattened" into another array.

What I am trying to do

I want to group by category, and create the array metric_1_grouped of dimensions (periods x categories), etc.
I want to create one dataframe per category, and one with the sum of all categories, where each row is a period and each column is a metric

The problem itself is fairly banal, but my question is what is a good way to do this as efficiently as possible, since I have to do this many times? A typical case would be:

300 periods
12 metrics
500,000 items
6 categories

Why I think this question is not a duplicate

I am aware of a few questions asking if there is a groupby equivalent in numpy, e.g. Is there any numpy group by function? but those are different because they all group by an element of the array itself. That is not what I am trying to do - I need to group, yes, but not by any element of the array itself, but rather by matching the column numbers to another array.

There are some questions which mention summing based on positions but, if I have understood them correctly, they do not resemble my case, e.g. Map numpy array and sum values on positions in another array and sum array with condition in another array with numpy

Please, please, pretty please, do not close this question unless you are sure it has been answered elsewhere - there are many questions which sound similar but are not. Thank you.

Potential solutions?

pandas dataframes with multi-index - but I'm afraid it might be much slower
itertools groupby? I admit I am not very familiar

What I have tried - it works, but is inelegant and kind of clunky

The code I have below works, but is inelegant and kind of clunky. I am hoping there is a better / more elegant / faster version?

import numpy as np
import pandas as pd

num_periods = 300
num_items = 1000
# Let's suppose for simplicity that the data has already been sorted by category
categories = np.empty(num_items, dtype=object)
categories[0:100]='a'
categories[100:300]='b'
categories[300:600]='c'
categories[600:]='d'

rng = np.random.default_rng(seed=42) #setting a seed for reproducibility
metric_1 = rng.normal(0,1,(num_periods,num_items))
metric_2 = rng.uniform(0,1,(num_periods,num_items)) 
unique_categories = np.unique(categories)
num_categories=len(unique_categories)


where_to_split  = np.unique(categories, return_index=True)[1][1:]
#  The second item of the tuple returned by np.unique is an array with the
# indices of the categores (which, remember, we had already sorted - this is
# a requirement),
# so it will be: [0, 100, 300. 600]
# so where_to_split is an array which is [100, 300, 600]

metric_1_list = np.split(metric_1, where_to_split, axis=1)
metric_1_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_1_list)):
    metric_1_by_category[:,i] = metric_1_list[i].sum(axis=1)
    
metric_2_list = np.split(metric_2, where_to_split, axis=1)
metric_2_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_2_list)):
    metric_2_by_category[:,i] = metric_2_list[i].sum(axis=1)
    
# we now create a dictionary of dataframes
# df_by_cat['a'] will be the dataframe for categiry a, etc    
df_by_cat = {}
for my_count, my_val in enumerate(unique_categories):
    df_by_cat[my_val] = pd.DataFrame(index = np.arange(0,num_periods), columns=['metric 1','metric 2'])
    df_by_cat[my_val]['metric 1'] = metric_1_by_category[:,my_count]
    df_by_cat[my_val]['metric 2'] = metric_2_by_category[:,my_count]

What's wrong with your code. What is the part you want to optimize? — Corralien, May 05 '23 at 22:22
Mostly the speed - if at all possible. Then I'm curious if there is a more elegant way — Pythonista anonymous, May 05 '23 at 22:32
Except using multiprocessing, I don't know how you can optimize your code since your category arrays are unbalanced so you can't use numpy broadcasting. You can avoid to iterate on both metrics. One loop is sufficient. — Corralien, May 05 '23 at 22:35

Quang Hoang · Answer 1 · 2023-05-05T23:40:29.133

I think Pandas' groupby on the unbalanced category is better than np.split. You can groupby on separate metrics with a for loop since you only have a relatively small amount of them. In this case, you don't really need multi-index. Or you can concatenate all your data in a multi-index dataframe, and perform a groupby and have a centralized dataframe instead of a list/dict of them.

Let's try the second approach:

## begin sample data
num_periods = 300
num_items = 1000
# Let's suppose for simplicity that the data has already been sorted by category
categories = np.empty(num_items, dtype=object)
categories[0:100]='a'
categories[100:300]='b'
categories[300:600]='c'
categories[600:]='d'

rng = np.random.default_rng(seed=42) #setting a seed for reproducibility
metric_1 = rng.normal(0,1,(num_periods,num_items))
metric_2 = rng.uniform(0,1,(num_periods,num_items)) 

### end sample data

### the metric dict:
metrics = {'metric1': metric_1, 'metric2': metric_2}

out = pd.concat({
    k: pd.DataFrame(v.T).assign(cat=categories) for k,v in metrics.items()
}).set_index('cat', append=True).groupby(level=[0,-1]).sum()

Then the first 5 periods would look like this:

                      0           1           2           3           4
        cat                                                            
metric1 a     -5.026961    5.278172   -3.108761    5.986694    0.229614
        b     -7.296710   -9.827326    5.355780    2.047677   19.013256
        c     -3.375551  -22.063953   22.150701   12.479241   -4.025634
        d    -13.192328  -54.765248    9.303484  -21.972627  -36.574717
metric2 a     54.575700   48.128531   52.650639   49.389781   49.621222
        b    104.931030  105.882775  100.459972  102.027632  100.234287
        c    149.866571  155.756189  150.443303  155.644543  147.322897
        d    210.539154  196.552705  199.897130  201.381321  204.282579

Now, let's say you want the data for metric_1 you can do:

out.loc['metric1']

Then you'd get (for first 5 periods):

             0          1          2          3          4
cat                                                       
a    -5.026961   5.278172  -3.108761   5.986694   0.229614
b    -7.296710  -9.827326   5.355780   2.047677  19.013256
c    -3.375551 -22.063953  22.150701  12.479241  -4.025634
d   -13.192328 -54.765248   9.303484 -21.972627 -36.574717

Update: Since you have relatively many items and small metrics, you can just groupby on each metric like:

out = pd.concat({
    k: pd.DataFrame(v).groupby(categories, axis=1).sum()
    for k,v in metrics.items()
})

And then out looks like:

                     a           b           c           d
metric1 0    -5.026961   -7.296710   -3.375551  -13.192328
        1     5.278172   -9.827326  -22.063953  -54.765248
        2    -3.108761    5.355780   22.150701    9.303484
        3     5.986694    2.047677   12.479241  -21.972627
        4     0.229614   19.013256   -4.025634  -36.574717
...                ...         ...         ...         ...
metric2 295  53.144215  106.399000  147.736396  207.730258
        296  48.941017  103.749010  141.037684  193.678342
        297  50.234486  103.213497  157.424556  203.338274
        298  48.845265   99.927642  152.657109  200.892997
        299  43.714031   99.497051  147.830122  197.667950

and data for a specific metric can be:

out.loc['metric1']

Output:

             a          b          c          d
0    -5.026961  -7.296710  -3.375551 -13.192328
1     5.278172  -9.827326 -22.063953 -54.765248
2    -3.108761   5.355780  22.150701   9.303484
3     5.986694   2.047677  12.479241 -21.972627
4     0.229614  19.013256  -4.025634 -36.574717
..         ...        ...        ...        ...
295   8.319109  -0.917488  27.867494   4.127136
296  14.944851 -10.455586  11.772255 -20.975950
297 -13.453457  -4.187631 -21.091565  15.851561
298 -13.933062   3.307818   7.864996 -31.559198
299  -8.041795   9.137808  -4.100661  11.852521

Update 2: numpy groupby functionality would be, in your case, np.sum.reduce_at

# notice the missing [1:]
where_to_split  = np.unique(categories, return_index=True)[1]

out = {k: np.add.reduceat(metric_1, where_to_split, axis=1) 
          for k, v in metrics.items()
      }

Corralien · Answer 2 · 2023-05-05T23:05:24.137

Maybe you can use:

from itertools import zip_longest

dfs = {}
cats, idx = np.unique(categories, return_index=True)
for cat, i, j in zip_longest(cats, idx, idx[1:], fillvalue=num_items):
    data = {'metric 1': metric_1[..., i:j].sum(axis=1),
            'metric 2': metric_2[..., i:j].sum(axis=1)}
    dfs[cat] = pd.DataFrame(data)