10

Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:

mat = np.zeros((1000,1000))
int1 = np.random.randint(0,999,size=(50000,))
int2 = np.random.randint(0,999,size=(50000,))
f = np.random.rand(50000)
mat[int1,int2] = f

But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.

Thanks!

Cindy Almighty
  • 903
  • 8
  • 20
  • 5
    Consider the `ufunc` `.at` method, e.g. [np.add.at indexing with array](https://stackoverflow.com/questions/45473896/np-add-at-indexing-with-array) – hpaulj Jun 29 '18 at 00:41
  • If you want the mean and there is no maximum number of times the entry could be updated, you'll need a 3D array to store all the values and then take the mean at the end. – Kyle Jun 29 '18 at 00:44

3 Answers3

6

Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:

import numpy as np

mat = np.zeros((2,2))
int1 = np.zeros(2, dtype=int)
int2 = np.zeros(2, dtype=int)
f = np.array([0,1])

np.add.at(mat, [int1, int2], f)
n = np.zeros((2,2))
np.add.at(n, [int1, int2], 1)
mat[int1, int2] /= n[int1, int2]
print(mat)

array([[0.5, 0. ],
       [0. , 0. ]])
Julien
  • 13,986
  • 5
  • 29
  • 53
  • Very clever and efficient! There is no version for median, though? – Cindy Almighty Jun 29 '18 at 03:17
  • I had a small play with median too but couldn't think of an easy way to get it. (Doesn't mean it doesn't exist :). The main reason is you need to keep a list of all collisions to compute a median, which forces you (I believe) to use python lists which don't integrate well with numpy vectorization... – Julien Jun 29 '18 at 03:35
5

You can manipulate your data in pandas and then assign.

Starting from

mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)

You can define a function

def get_aggregated_collisions(a,b,c):
    df = pd.DataFrame({'x':a, 'y':b, 'v':c})
    df['coord'] = df[['x','y']].apply(tuple,1)
    d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list')
    return d

and then

d = get_aggregated_collisions(a,b,c)
mat[d['x'], d['y']] = d['v']

The whole operation (including generating the matrixes, np.random etc) ran quite ok

1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The idea behind making a tuple of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.

Tai
  • 7,684
  • 3
  • 29
  • 49
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • 1
    You don't need to create `tuple`. Just do the grouping based on both `x` and `y` columns. – Tai Jun 29 '18 at 03:40
  • @CindyAlmighty Yes, it does ! :) Just change `"mean"` for `"meadian"` or whatever operation you might want. – rafaelc Jun 29 '18 at 12:10
  • @Tai handn't slept in the past 2 days haha:thanks for pointing out. I won't edit not to make your answer redundant. Thanks ;} – rafaelc Jun 29 '18 at 12:11
  • @RafaelC sounds like you have rough days. Feel free to edit your answer to provide users with better information. I would not mind. – Tai Jun 29 '18 at 12:22
3

My trial based on RafaelC's answer.

First do groupby on ["x", "y"], then take mean or median of each group, and finally reset the index with reset_index().

import pandas as np
# setup
mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)
# Start here
df = pd.DataFrame({"x":a, "y":b, "val":c})
v = df.groupby(["x", "y"]).mean().reset_index()
mat[v["x"], v["y"]] += v["val"]

If medians are needed, modify v to be

v = df.groupby(["x", "y"]).median().reset_index()
Tai
  • 7,684
  • 3
  • 29
  • 49