Compute in vector form the statistics for each subset of data selected on several grouping features

Question

Consider the code below of an operation in non-vectorized form

import numpy as np

N = 110
observables = np.random.rand(N)

I = np.random.randint(np.array([0,0])[:,None], np.array([4,5])[:,None], 
                     size=(2,N))

# you can think of of this as two groupings of the data with numbers
# indicating separate groups (one has 4 groups the other 5). I want 
# to compute then all the averages for each i,j where i is the index 
# from the first grouping and j the second

averages = np.zeros((4,5))

#unvectorized
for i in range(4):
 for j in range(5):
   J = np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten()
   averages[i,j] = np.mean(observables[J])

I came up with the following vectorization but it is very inefficient for large N as least common multiple will grow to intractable sizes for even something like N=1000

import math

#vectorized inefficiently
J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]
lengths = [len(j) for j in J]

L = math.lcm(*lengths)

J = np.array([np.tile(j,int(L/len(j))) for j in J])

averages_vectorized = np.reshape(np.mean(observables[J], axis=1), (4,5))

Is there any other way to vectorize this? For instance can a list of indices like [0,1,2] be extended with something such as [0, 1, 2, sth, sth] such that when I try to access elements of a numpy vector with this list of indices, sth is not taken into account?

ps:

There is also the following way which is just hiding for loops in two list comprehensions

J = [np.argwhere((I[0,:]==i) & (I[1,:]==j)).flatten() for i in range(4)
     for j in range(5)]

averages_list_comrehension = np.reshape(np.array([np.mean(observables[j]) for j in J]), (4,5))

ps2: One way to do it as to add a nan value at the end of the observables and then extend any element of J with the index N until all are of the same size and use np.nanmean. However my end goal is to apply this in the context of tensors from PyTensor so not sure how one would do the same there (in that case observables would be scalar tensors and I dont know if there a nan tensor and nanmean in PyTensor)

Vitalizzare · Answer 1 · 2023-09-03T09:10:33.923

Update: Represent data in the (n+1)-dimentional array

If the data and number of groups are not big, we can represent them as a 3D-array in case of 2 grouping, where the first axis is an index of values, and the last two - group numbers for each grouping feature. For example, if the first value x₀ belongs to groups 1 and 2, then the array value arr[0, 1, 2] is equal to x₀, while all other values arr[0, ...] are empty. If the next x₁ value belongs to groups 2 and 0, then arr[1, 2, 0] is equal to x₁, while others arr[1, ...] == nan, and so on.

import numpy as np
from numpy.random import default_rng
rng = default_rng(0)

# parameters
N = 110
group_counts = (4, 5)

# init
values = rng.random(N)
groups = rng.integers(0, group_counts, size=(N, len(group_counts)))

# transform values into (n+1)-dimensional array, where n is number of grouping features
arr = np.full((N, *group_counts), np.nan, dtype=float)
arr[range(N), *groups.T] = values

# calculate mean values along axis 0 ignoring nan values
answer = np.nanmean(arr, axis=0)

In case of big data, we can use sparse arrays:

from scipy import sparse

... # initialize  values and groups as previously

# mark each combination of grouping features by unique ID numbers
features = np.arange(np.prod(group_counts)).reshape(group_counts)

# create a compressed sparse column matrix 
# with the first index as indices along values
# and the second one is ID of feature combinations 
# for corresponding values 
arr = sparse.csc_array(
    (values, (np.arange(N), features[*groups.T])),
    shape=(N, np.prod(group_counts))
)

But in this case we have to be careful, because empty values will be replaced with 0 and taken into account when computing statistics:

answer1 = arr.mean(axis=0).reshape(group_counts)
assert not np.allclose(answer, answer1)    # we obtain a different answer in this case

To overcome this issue, we have to re-implement the functions of interest. In case of mean we can try this:

value_counts = np.diff(arr.indptr)
answer2 = (arr.sum(0) / value_counts).reshape(group_counts)
assert np.allclose(answer, answer2)

Here np.diff(arr.indptr) is equal to [x.nnz for x in arr.T] and returns the number of values placed in each column (note, that zeros which are explicitly placed in the array, should be taken into account). See details about nnz and indptr

Alternatively, we can try n-dimentional sparse arrays:

import sparse

arr = sparse.COO([range(N), *groups.T], values, shape=(N, *group_counts), fill_value=np.nan)
answer = sparse.nanmean(arr, axis=0).todense()

Previous answer: Split sorted group combinations

Sort group combinations and split:

def mean_by_groups(values, groups):
    # ensure it can work with structures other then numpy.ndarray
    values = np.asarray(values)
    groups = np.atleast_2d(np.asarray(groups))
    
    # ensure we have group mappings along the last value axis 
    assert values.shape[-1] == groups.shape[-1], 'Inappropriate shape of group mappings'
    assert (groups >= 0).all(), 'group numbering should start from 0'
    
    length = values.shape[-1]
    groups_count = 1 + groups.max(-1)
    # by default, lexical sorting is performed on the last axis
    sorted_by_group = np.lexsort(groups)
    # determine where combinations of group mappings are changing
    # here prepend=-1 to include the first combination (group number != -1)
    diff_from_prev = np.diff(groups[:, sorted_by_group], prepend=-1).any(0)
    bins = np.arange(length)[diff_from_prev]
    coordinates = np.transpose(groups[:, sorted_by_group][:, bins])
    # here we use bins[1:] to avoid splitting in front of the first value
    value_groups = np.split(values[sorted_by_group], bins[1:])
    ans = np.full(shape=groups_count, fill_value=np.nan)
    for group_combination, arr in zip(coordinates, value_groups):
        # here we use tuple to address a single cell
        ans[tuple(group_combination)] = arr.mean()
        
    return ans


answer = mean_by_groups(observables, I)

Following this way, we can have more than two group mappings. Also consider numpy.full instead of numpy.zeros. We can have group combinations which are not present in the mapping and should stay nan instead of 0.

This is not really vectorizing in the way I am asking for, it is still iterating over a for loop which has the same length as iterating over two loops of size 4 and 5. I am more imagining a situation in which one can say ans = mean(observables[I], axis=1). See for instance the second ps. — Sina, Aug 31 '23 at 16:30

Compute in vector form the statistics for each subset of data selected on several grouping features

1 Answers1

Update: Represent data in the (n+1)-dimentional array

Previous answer: Split sorted group combinations