Mean value calculation for clustered data using NumPy

Question

Suppose I have two arrays:

x which contains m points;
c which contains m cluster ids for each corresponding point from x.

I want to calculate the mean value for points which share the same id, i.e. which belong to the same cluster. I know that c contains integers from the range [0, k) and all the values are present in the c. My current solution looks like the following:

import numpy as np

np.random.seed(42)

k = 3
x = np.random.rand(100, 2)
c = np.random.randint(0, k, size=x.shape[0])
mu = np.zeros((k, 2))

for i in range(k):
    mu[i] = x[c == i].mean(axis=0)

While this approach works, I'm wondering if there is a more efficient way to calculate the means in NumPy without having to use an explicit for loop?

Related: https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function You might also consider using a pandas groupby, if you're willing to step outside of numpy. — Chrysophylaxs, Mar 01 '23 at 18:16

score 2 · Answer 1 · answered Mar 01 '23 at 19:34

2

You can do that relatively efficiently using:

# Number of different IDs
size = np.max(c) + 1
xBins = np.zeros(size)
yBins = np.zeros(size)

# Sum by IDs
np.add.at(xBins, c, x[:,0])
np.add.at(yBins, c, x[:,1])

# Compute the number of item per ID
counts = np.bincount(c)

# Compute the mean
xBins /= counts
yBins /= counts

# Build the final array
mu = np.vstack([xBins, yBins]).T

Also note that using a (2,N) array is generally more efficient for storing points than using a (N,2) array.

answered Mar 01 '23 at 19:34

Jérôme Richard

41,678
6
29
59

1

Implementing the example given in the question, this method and the most efficient version from the link that @Chrysophylaxs commented with, Jérôme's is the most efficient. andywiecko: 0.37ms / 10k iter; Chrysophylaxs (et al): 0.58; Jérôme: 0.34 . – Saul Aryeh Kohn Mar 01 '23 at 20:02
1

You don't need to separate into `xBins, yBins`. Simply define `bins = np.zeros((size, x.shape[1])` and then `np.add.at(bins, c, x)` gives the sum per bin. This works well when `x` has many columns where it wouldn't be practical to separate each column into a separate variable – Pranav Hosangadi Mar 01 '23 at 20:20

Mean value calculation for clustered data using NumPy

1 Answers1