4

I have the following example array of x-y coordinate pairs:

A = np.array([[0.33703753, 3.],
              [0.90115394, 5.],
              [0.91172016, 5.],
              [0.93230994, 3.],
              [0.08084283, 3.],
              [0.71531777, 2.],
              [0.07880787, 3.],
              [0.03501083, 4.],
              [0.69253184, 4.],
              [0.62214452, 3.],
              [0.26953094, 1.],
              [0.4617873 , 3.],
              [0.6495549 , 0.],
              [0.84531478, 4.],
              [0.08493308, 5.]])

My goal is to reduce this to an array with six rows by taking the average of the x-values for each y-value, like so:

array([[0.6495549 , 0.        ],
       [0.26953094, 1.        ],
       [0.71531777, 2.        ],
       [0.41882167, 3.        ],
       [0.52428582, 4.        ],
       [0.63260239, 5.        ]])

Currently I am achieving this by converting to a pandas dataframe, performing the calculation, and converting back to a numpy array:

>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
     y         x
0  0.0  0.649555
1  1.0  0.269531
2  2.0  0.715318
3  3.0  0.418822
4  4.0  0.524286
5  5.0  0.632602

Is there a way to perform this calculation using numpy, without having to resort to the pandas library?

CDJB
  • 14,043
  • 5
  • 29
  • 55
  • Does this answer your question? [Is there any numpy group by function?](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) – Pranav Hosangadi Dec 19 '22 at 16:13
  • @PranavHosangadi unfortunately not, answers to that question produce lists of the x-coordinates but do not maintain the y-coordinates nor perform the mean computation. – CDJB Dec 19 '22 at 16:17
  • Clever np answers have been suggested. But Is there any benefit in not using Pandas which produces the np array using a readable one-liner? – user19077881 Dec 19 '22 at 16:40
  • 1
    @user19077881 If this is the only reason pandas is being imported, then a numpy-only answer can avoid the need for an extra library – Pranav Hosangadi Dec 19 '22 at 16:44
  • 1
    @user19077881 I added some timing comparisons between the pandas and numpy-only methods in [my answer below](https://stackoverflow.com/a/74853319/843953). Numpy-only wins handily, so there's another reason to use it instead of going through pandas. – Pranav Hosangadi Dec 19 '22 at 17:51

4 Answers4

4

Here's a completely vectorized solution that only uses numpy methods and no python iteration:

sort_indices = np.argsort(A[:, 1])
unique_y, unique_indices, group_count  = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)

Once we have the indices and counts of all the unique elements, we can use the np.ufunc.reduceat method to collect the results of np.add for each group, and then divide by their counts to get the mean:

group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)

group_mean = group_sum / group_count[:, None]
# array([[0.6495549 , 0.        ],
#        [0.26953094, 1.        ],
#        [0.71531777, 2.        ],
#        [0.41882167, 3.        ],
#        [0.52428582, 4.        ],
#        [0.63260239, 5.        ]])

Benchmarks:

Comparing this solution with the other answers here (Code at tio.run) for

  1. A contains 10k rows, with A[:, 1] containing N groups, N varies from 1 to 10k Timing for different methods with 10k rows, N groups

  2. A contains N rows (N varies from 1 to 10k), with A[:, 1] containing min(N, 1000) groups Timing for different methods with N rows, 1k groups

Observations: The numpy-only solutions (Dani's and mine) win easily -- they are significantly faster than the pandas approach (possibly since the time taken to create the dataframe is an overhead that doesn't exist for the former).

The pandas solution is slower than the python+numpy solutions (Jaimu's and mine) for smaller arrays, since it's faster to just iterate in python and get it over with than to create a dataframe first, but these solutions become much slower than pandas as the array size or number of groups increases.


Note: The previous version of this answer iterated over the groups as returned by the accepted answer to Is there any numpy group by function? and individually calculated the mean:

First, we need to sort the array on the column you want to group by

A_s = A[A[:, 1].argsort(), :]

Then, run that snippet. np.split splits its first argument at the indices given by the second argument.

unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True) 
# (array([0., 1., 2., 3., 4., 5.]), array([ 0,  1,  2,  3,  9, 12])) 

split_indices = unique_indices[1:] # No need to split at the first index

groups = np.split(A_s, split_indices)
# [array([[0.6495549, 0.       ]]),
#  array([[0.26953094, 1.        ]]),
#  array([[0.71531777, 2.        ]]),
#  array([[0.33703753, 3.        ],
#         [0.93230994, 3.        ],
#         [0.08084283, 3.        ],
#         [0.07880787, 3.        ],
#         [0.62214452, 3.        ],
#         [0.4617873 , 3.        ]]),
#  array([[0.03501083, 4.        ],
#         [0.69253184, 4.        ],
#         [0.84531478, 4.        ]]),
#  array([[0.90115394, 5.        ],
#         [0.91172016, 5.        ],
#         [0.08493308, 5.        ]])]

Now, groups is a list containing multiple np.arrays. Iterate over the list and mean each array:

means = np.zeros((len(groups), groups[0].shape[1]))
for i, grp in enumerate(groups):
    means[i, :] = grp.mean(axis=0)

# array([[0.6495549 , 0.        ],
#        [0.26953094, 1.        ],
#        [0.71531777, 2.        ],
#        [0.41882167, 3.        ],
#        [0.52428582, 4.        ],
#        [0.63260239, 5.        ]])
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
3

Use np.bincount + np.unique:

sums = np.bincount(A[:, 1].astype(np.int64), weights=A[:, 0])
values, counts = np.unique(A[:, 1], return_counts=True)
res = np.vstack((sums / counts, values)).T
print(res)

Output

[[0.6495549  0.        ]
 [0.26953094 1.        ]
 [0.71531777 2.        ]
 [0.41882167 3.        ]
 [0.52428582 4.        ]
 [0.63260239 5.        ]]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
2

Here is a work around using numpy.

unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.empty((unique_ys.shape[0], 2))

for i, y in enumerate(unique_ys):
    result[i, 0] = np.mean(A[indices == i, 0])
    result[i, 1] = y

print(result)

Alternative:
To make the code more pythonic, you can use a list comprehension to create the result array, instead of using a for loop.

unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.array([[np.mean(A[indices == i, 0]), y] for i, y in enumerate(unique_ys)])

print(result)

Output:

[[0.6495549  0.        ]
 [0.26953094 1.        ]
 [0.71531777 2.        ]
 [0.41882167 3.        ]
 [0.52428582 4.        ]
 [0.63260239 5.        ]]
Jamiu S.
  • 5,257
  • 5
  • 12
  • 34
0

If you know the y values beforehand, you could try to match the array for each:

for example:

A[(A[:,1]==1),0] will give you all the x values where the y value is equal to 1.

So you could go through each value of y, sum the A[:,1]==y[n] to get the number of matches, sum the x values that match, divide to make the average, and place in a new array:

B=np.zeros([6,2])

for i in range( 6):
    nmatch=sum(A[:,1]==i)
    nsum=sum(A[(A[:,1]==i),0])
    
    B[i,0]=i
    B[i,1]=nsum/nmatch

There must be a more pythonic way of doing this ....

XaC
  • 432
  • 3
  • 9