0

Sorry for the cryptic description.....

I'm workng in Python and need a fast solution for the below problem

I have an array of float values in one array (this array length can include be millions of values

values = [0.1, 0.2, 5.7, 12.9, 3.5, 100.6]

Each value represents an estimate of a quantity at a particular location where the location is identified by an ID. Multiple estimates per location are possible/common

locations = [1, 5, 3, 1, 1, 3]

I need to average all of the values that that share the same location id.

I can use numpy.where to do this for one location value

average_value_at_location = np.average(values[np.where(locations == 1)])

And of course I could loop over all of the unique values in locations..... But I'm looking for a fast (vectorized) way of doing this and can't figure out how to compose the numpy functions to do this without looping in Python.....

I'm not tied to numpy for this solution.

Any help will be gratefully received.

Thanks,

Doug

scotsman60
  • 251
  • 1
  • 10
  • 1
    it sounds like doing average per group, here is a [link](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) that gives a couple of solutions to do so in numpy – Ben.T Aug 25 '21 at 20:33
  • 1
    I think looping over locations in `np.unique` isn't going to be that much slower, especially if you have only a few locations. Most of your work is doing `np.mean` over millions of values, after all. If you really want to squeeze out that little bit of performance out of your loop, you could try using Numba to compile the loop. You could also use Numba to avoid making copies of arrays, but that would require you implement your own `mean` algorithm, which is actually kinda complicated (there are many algorithms with different tradeoffs, and you'd probably want to match `np.mean`). – BatWannaBe Aug 25 '21 at 20:44
  • 1
    @Ben.T - Thanks for the links. One of them points to a pandas groupby solution which I had also comes across sonce posting. It looks pretty good - averaged 100,000 values across 25,000 location in less than 0.01 seconds on my laptop. I'm currently looking at the numpy_indexed package which one of the links recommends - for some reason I'm motivated to find a numpy solution....... – scotsman60 Aug 25 '21 at 23:37

3 Answers3

1

Assuming locations go from 0 to a maximum value of locmax (e.g. locmax=5), you can create a 2d array of nans to store the values at the corresponding location:

placements = np.zeros((values.size, locmax+1)) * np.nan

Then assign all the values using indexing:

placements[np.arange(values.size), locations] = values

Finally, calculate the np.nanmean along axis 0:

means = np.nanmean(placements, axis=0)

For your example this results in:

array([  nan,  5.5 ,   nan, 53.15,   nan,  0.2 ])
Jan Christoph Terasa
  • 5,781
  • 24
  • 34
1

Using add.reduceat for every group.

Preparing the arrays

import numpy as np

values = np.array([0.1, 0.2, 5.7, 12.9, 3.5, 100.6])
locations = np.array([1, 5, 3, 1, 1, 3])

Getting the indices to sort the arrays in groups

locsort = np.argsort(locations)
# locations[locsort] -> [  1,    1,   1,   3,     3,   5]
# values[locsort]    -> [0.1, 12.9, 3.5, 5.7, 100.6, 0.2]

Computing the start index for each group

i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
# [0, 3, 5]

Adding values per group and dividing by the group size

np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))
# [ 16.5, 106.3,   0.2] / [3, 2, 1]            

Output

array([ 5.5 , 53.15,  0.2 ])
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
1

OK - I've tried four solutions based on the replies here. So far, the pandas groupby approach is the winner, but the numpy add.reduceat solution proposed by Michael S is a close second......

Using pandas (from the link provided by Ben T)

# Set up the data arrays
rng = np.random.default_rng(12345)
values = rng.random(size = 100000)
locations = rng.integers(low = 1, high = 25000, size = 100000)

#Create the pandas dataframe
df = pd.DataFrame({"locations":locations, "values": values})

# groupby and mean
start=timer()
average_by_location_pandas = df.groupby(["locations"]).mean()
end=timer()

print("Pandas time :", end-start)

Pandas time : 0.009602722000000008

Using numpy np.where and list comprehension to lop over unique locations

unique_locations = np.unique(locations)
average_by_location_numpy = [(i, values[locations==i].mean()) for i in unique_locations]

Numpy time : 2.644003632

Using numpy_indexed (link provide by Ben T)

average_by_location_numpy_indexed = npi.group_by(locations).mean(values)

Numpy_indexed time : 0.03701074199999965

Using numpy add.reduceat (solution proposed by Michael S)

locsort = np.argsort(locations)
i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
out = np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))

Numpy add_reduceat time : 0.01057279099999997
scotsman60
  • 251
  • 1
  • 10
  • I`m surprised that the pandas solution win against the different numpy solutions. I guess the number for different locations is the reason. – Ben.T Aug 26 '21 at 12:17