NumPy apply function to groups of rows corresponding to another numpy array

Question

I have a NumPy array with each row representing some (x, y, z) coordinate like so:

a = array([[0, 0, 1],
           [1, 1, 2],
           [4, 5, 1],
           [4, 5, 2]])

I also have another NumPy array with unique values of the z-coordinates of that array like so:

b = array([1, 2])

How can I apply a function, let's call it "f", to each of the groups of rows in a which correspond to the values in b? For example, the first value of b is 1 so I would get all rows of a which have a 1 in the z-coordinate. Then, I apply a function to all those values.

In the end, the output would be an array the same shape as b.

I'm trying to vectorize this to make it as fast as possible. Thanks!

Example of an expected output (assuming that f is count()):

c = array([2, 2])

because there are 2 rows in array a which have a z value of 1 in array b and also 2 rows in array a which have a z value of 2 in array b.

A trivial solution would be to iterate over array b like so:

for val in b:
    apply function to a based on val
    append to an array c

My attempt:

I tried doing something like this, but it just returns an empty array.

func(a[a[:, 2]==b])

Is the expected output just the count or all of those indices? — Divakar, Feb 28 '20 at 08:13
@Divakar The expected output in the example is the count based on those indices, but "count" in my case can be any function — cmed123, Feb 28 '20 at 08:14

Andreas K. · Accepted Answer · 2020-02-28T10:07:51.480

The problem is that the groups of rows with the same Z can have different sizes so you cannot stack them into one 3D numpy array which would allow to easily apply a function along the third dimension. One solution is to use a for-loop, another is to use np.split:

a = np.array([[0, 0, 1],
              [1, 1, 2],
              [4, 5, 1],
              [4, 5, 2],
              [4, 3, 1]])


a_sorted = a[a[:,2].argsort()]

inds = np.unique(a_sorted[:,2], return_index=True)[1]

a_split = np.split(a_sorted, inds)[1:]

# [array([[0, 0, 1],
#         [4, 5, 1],
#         [4, 3, 1]]),

#  array([[1, 1, 2],
#         [4, 5, 2]])]

f = np.sum  # example of a function

result = list(map(f, a_split))
# [19, 15]

~~But imho the best solution is to use pandas and groupby as suggested by FBruzzesi. You can then convert the result to a numpy array.~~

EDIT: For completeness, here are the other two solutions

List comprehension:

b = np.unique(a[:,2])
result = [f(a[a[:,2] == z]) for z in b]

Pandas:

df = pd.DataFrame(a, columns=list('XYZ'))
result = df.groupby(['Z']).apply(lambda x: f(x.values)).tolist()

This is the performance plot I got for a = np.random.randint(0, 100, (n, 3)):

As you can see, approximately up to n = 10^5 the "split solution" is the fastest, but after that the pandas solution performs better.

Thanks for this! I'm wondering why you say it's better to use pandas though? Isn't pandas slower than numpy? Or am I wrong? Also not sure if it's good practice to do something in pandas and convert back to numpy — cmed123, Feb 28 '20 at 08:31
I don't know if the pandas solution is better in terms of speed, but it is more clean (just two lines: convert to df, groupby + apply your function). — Andreas K., Feb 28 '20 at 08:33

FBruzzesi · Answer 2 · 2020-02-28T08:10:41.410

1

If you are allowed to use pandas:

import pandas as pd
df=pd.DataFrame(a, columns=['x','y','z'])

df.groupby('z').agg(f)

Here f can be any custom function working on grouped data.

Numeric example:

a = np.array([[0, 0, 1],
              [1, 1, 2],
              [4, 5, 1],
              [4, 5, 2]])
df=pd.DataFrame(a, columns=['x','y','z'])
df.groupby('z').size()

z
1    2
2    2
dtype: int64

Remark that .size is the way to count number of rows per group.

To keep it into pure numpy, maybe this can suit your case:

tmp = np.array([a[a[:,2]==i] for i in b])
tmp 
array([[[0, 0, 1],
        [4, 5, 1]],

       [[1, 1, 2],
        [4, 5, 2]]])

which is an array with each group of arrays.

edited Feb 28 '20 at 08:10

answered Feb 28 '20 at 08:02

FBruzzesi

6,385
3
15
37

1

Thanks for your comment! I would prefer to just use NumPy though. Is there a way to convert that to NumPy style though? – cmed123 Feb 28 '20 at 08:03
The idea of `numpy` solution is right, however, list comprehensions should be avoided in `numpy`. – mathfux Feb 28 '20 at 08:13
Would the list comprehension still evaluate the elements one-by-one as opposed to being vectorized? – cmed123 Feb 28 '20 at 08:15
@cmed123 Let me think a little bit, seems like I didn't look deep enough to this problem. Groups are not required to have the same length, isn't it? – mathfux Feb 28 '20 at 08:20
1

list comprehensiosn are ot avoidable in that case. – mathfux Feb 28 '20 at 08:21
@mathfux Yup groups are not required to have the same length. Hmm you mean there's no way to do it without some for loop? – cmed123 Feb 28 '20 at 08:22
Yes, exactly. If arrays have different lengths, it can't share the same vectorized action an one needs to use iteration. The only approach I see is using mask arrays for each group but I'm in doubt about efficiency in this case. – mathfux Feb 28 '20 at 08:26

score 1 · Answer 3 · answered Feb 28 '20 at 08:12

1

c = np.array([])
for x in np.nditer(b):
    c = np.append(c, np.where((a[:,2] == x))[0].shape[0])

Output:

[2. 2.]

answered Feb 28 '20 at 08:12

Zaraki Kenpachi

5,510
2
15
38

Thanks for the suggestion! Doesn't the for loop still do it one-by-one though and not vectorize it? I'm not too familiar with np.nditer – cmed123 Feb 28 '20 at 08:14

NumPy apply function to groups of rows corresponding to another numpy array

3 Answers3