How to group and sum certain columns of an array based on their classification (eg to group cities by country)

Question

The issue

I have arrays which track certain items over time. The items belong to certain categories. I want to calculate the sum by time and category, e.g. to go from a table by time and city to one by time and country.

I have found a couple of ways, but they seem clunky - there must be a better way! Surely I'm not the first one with this issue? Maybe using np.where?

More specifically:

I have a number of numpy arrays of shape (p x i), where p is the period and i is the item I am tracking over time. I then have a separate array of shape i which classifies the items into categories (red, green, yellow, etc.).

What I want to do is calculate an array of shape (p x number of unique categories) which sums the values of the big array by time and category. In pictures:

I'd need the code to be as efficient as possible as I need to do this multiple times on arrays which can be up to 400 x 1,000,000

What I have tried:

This question covers a number of ways to groupby without resorting to pandas. I like the scipy.ndimage approach, but AFAIK it works on one dimension only.

I have tried a solution with pandas:

I create a dataframe of shape periods x items
I unpivot it with pd.melt(), join the categories and do a crosstab period/categories

I have also tried a set of loops, optimised with numba:

A first loop creates an array which converts the categories into integers, i.e. the first category in alphabetical order becomes 0, the 2nd 1, etc
A second loop iterates through all the items, then for each item it iterates through all the periods and sums by category

My findings

for small arrays, pandas is faster
for large arrays, numba is better, but it's better to set parallel = False in the numba decorator
for very large arrays, numba with parallel = True shines parallel = True makes use of numba's parallelisation by using numba.prange on the outer loops.

PS I am aware of the pitfalls of premature optimisation etc etc - I am only looking into this because a significant amount of time is spent doing precisely this

The code

import numpy as np
import pandas as pd
import time
import numba

periods = 300
n = int(2000)
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
# my_arrays will have shape (periods x (n * number of categories))


#---- pandas
start = time.time()

df_categories = pd.DataFrame(data = categories).reset_index().rename(columns ={'index':'item',0:'category'})
df = pd.DataFrame(data = my_array)
unpiv = pd.melt(df.reset_index(), id_vars ='index', var_name ='item', value_name ='value').rename( columns = {'index':'time'})
unpiv = pd.merge(unpiv, df_categories, on='item' )
crosstab = pd.crosstab( unpiv['time'], unpiv['category'], values = unpiv['value'], aggfunc='sum' )

print("panda crosstab in:")
print(time.time() - start)
# yep, I know that timeit.timer would have been better, but I was in a hurry :)
print("")


#---- numba
@numba.jit(nopython = True, parallel = True, nogil = True)
def numba_classify(x, categories):
    cat_uniq = np.unique(categories)
    num_categories = len(cat_uniq)
    num_items = x.shape[1]
    periods = x.shape[0]
    categories_converted = np.zeros(len(categories), dtype = np.int32)
    out = np.zeros(( periods, num_categories))
    
    
    # before running the actual classification, I must convert the categories, which can be strings, to
    # the corresponsing number in cat_uniq, e.g. if brown is the first category by alphabetical sorting, then
    # brown --> 0, etc
    
    for i in numba.prange(num_items):
        for c in range(num_categories):
            if categories[i] == cat_uniq[c]:
                categories_converted[i] = c
      
        
    for i in numba.prange(num_items):        
        for p in range(periods):
            out[ p, categories_converted[i] ] += x[p,i]


    return out

start = time.time()

numba_out = numba_classify(my_array, categories)
print("numba done in:")
print(time.time() - start)

Have you consider using groupby like `pd.DataFrame(data = my_array, columns=categories).groupby(level=0, axis=1).sum()`? and there is possibly a numpy version of groupby (somewhere on stackoverflow) that could also increase the speed :) — Ben.T, Feb 23 '21 at 21:43

score 1 · Accepted Answer · answered Feb 23 '21 at 21:52

1

You can use df.groupby(categories, axis=1).sum() for a substantial speedup.

import numpy as np
import pandas as pd
import time


def make_data(periods, n):
    categories = np.tile(['red','green','yellow','brown'],n)
    my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
    
    return categories, pd.DataFrame(my_array)

for n in (200, 2000, 20000):
    categories, df = make_data(300, n)
    true_n = n * 4
    
    start = time.time()
    tabulation =df.groupby(categories, axis=1).sum()
    elapsed = time.time() - start
    
    print(f"300 x {true_n:5}: {elapsed:.3f} seconds")

# prints:
300 x   800: 0.005 seconds
300 x  8000: 0.021 seconds
300 x 80000: 0.673 seconds

answered Feb 23 '21 at 21:52

Cameron Riddell

10,942
9
19

This is brilliant! I wasn't familiar with using the axis argument in groupby. Do you know of other examples / docs? Panda page for groupby provides no examples https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html – Pythonista anonymous Feb 23 '21 at 22:14
Unfortunately I don't know of anywhere with a lot of examples- try searching it on stackoverflow or coding creek. It's an application I've picked up from answering questions on here. If it helps to think about it this way, it's the functional equivalent of transposing your dataframe, grouping/aggregating, then transposing back. – Cameron Riddell Feb 23 '21 at 22:16
PS on my PC, numba becomes faster than your solution only for arrays > (300 x 400,000). Maybe I can build some logic to run numba only on massive arrays, and your code on all others – Pythonista anonymous Feb 23 '21 at 22:18
1

Ah good to know, I don't have a numba environment set up so I couldn't test that out. But that's pretty common when it comes to `DataFrames`. You could definitely write a function that uses groupby, and have your other function that uses numba. Then a third function that wraps both of them with an `if` statement that checks the `dataframe` size and calls the more appropriate function. – Cameron Riddell Feb 23 '21 at 22:23

How to group and sum certain columns of an array based on their classification (eg to group cities by country)

The issue

What I have tried:

My findings

The code

1 Answers1