What my data looks like
I run certain numerical simulations in numba.
The output is a set of numpy arrays; each array represents a metric, and each array has shape (periods x items)
.
E.g. metric_1[p,i]
tells me the value of metric_1 at time p
for item i
.
Each item belongs to a certain category - let's say red and green just for the sake of an example.
The one-dimensional array categories
tells me exactly this - e.g. categories[0]='a'
means the first item belongs to category a. Conceptually, this is like having a pandas multi-index "flattened" into another array.
What I am trying to do
I want to group by category, and create the array
metric_1_grouped
of dimensions(periods x categories)
, etc.I want to create one dataframe per category, and one with the sum of all categories, where each row is a period and each column is a metric
The problem itself is fairly banal, but my question is what is a good way to do this as efficiently as possible, since I have to do this many times? A typical case would be:
- 300 periods
- 12 metrics
- 500,000 items
- 6 categories
Why I think this question is not a duplicate
I am aware of a few questions asking if there is a groupby equivalent in numpy, e.g. Is there any numpy group by function? but those are different because they all group by an element of the array itself. That is not what I am trying to do - I need to group, yes, but not by any element of the array itself, but rather by matching the column numbers to another array.
There are some questions which mention summing based on positions but, if I have understood them correctly, they do not resemble my case, e.g. Map numpy array and sum values on positions in another array and sum array with condition in another array with numpy
Please, please, pretty please, do not close this question unless you are sure it has been answered elsewhere - there are many questions which sound similar but are not. Thank you.
Potential solutions?
- pandas dataframes with multi-index - but I'm afraid it might be much slower
- itertools groupby? I admit I am not very familiar
What I have tried - it works, but is inelegant and kind of clunky
The code I have below works, but is inelegant and kind of clunky. I am hoping there is a better / more elegant / faster version?
import numpy as np
import pandas as pd
num_periods = 300
num_items = 1000
# Let's suppose for simplicity that the data has already been sorted by category
categories = np.empty(num_items, dtype=object)
categories[0:100]='a'
categories[100:300]='b'
categories[300:600]='c'
categories[600:]='d'
rng = np.random.default_rng(seed=42) #setting a seed for reproducibility
metric_1 = rng.normal(0,1,(num_periods,num_items))
metric_2 = rng.uniform(0,1,(num_periods,num_items))
unique_categories = np.unique(categories)
num_categories=len(unique_categories)
where_to_split = np.unique(categories, return_index=True)[1][1:]
# The second item of the tuple returned by np.unique is an array with the
# indices of the categores (which, remember, we had already sorted - this is
# a requirement),
# so it will be: [0, 100, 300. 600]
# so where_to_split is an array which is [100, 300, 600]
metric_1_list = np.split(metric_1, where_to_split, axis=1)
metric_1_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_1_list)):
metric_1_by_category[:,i] = metric_1_list[i].sum(axis=1)
metric_2_list = np.split(metric_2, where_to_split, axis=1)
metric_2_by_category = np.zeros((num_periods, num_categories))
for i in range(len(metric_2_list)):
metric_2_by_category[:,i] = metric_2_list[i].sum(axis=1)
# we now create a dictionary of dataframes
# df_by_cat['a'] will be the dataframe for categiry a, etc
df_by_cat = {}
for my_count, my_val in enumerate(unique_categories):
df_by_cat[my_val] = pd.DataFrame(index = np.arange(0,num_periods), columns=['metric 1','metric 2'])
df_by_cat[my_val]['metric 1'] = metric_1_by_category[:,my_count]
df_by_cat[my_val]['metric 2'] = metric_2_by_category[:,my_count]