2

I have 4 1D Numpy arrays of equal length. The first three act as an ID, uniquely identifying the 4th array.

The ID arrays contain repeated combinations, for which I need to sum the 4th array, and remove the repeating element from all 4 arrays.

x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])

In this case I need:

x = [1, 2, 4]
y = [1, 1, 4]
z = [1, 2, 2]
data = [6, 7, 3]

The arrays are rather long so loops really won't work. I'm sure there is a fairly simple way to do this, but for the life of me I can't figure it out.

fox21
  • 23
  • 3

2 Answers2

4

To get started, we can stack the ID vectors into a matrix such that each ID is a row of three values:

XYZ = np.vstack((x,y,z)).T

Now, we just need to find the indices of repeated rows. Unfortunately, np.unique doesn't operate on rows, so we need to do some tricks:

order = np.lexsort(XYZ.T)
diff = np.diff(XYZ[order], axis=0)
uniq_mask = np.append(True, (diff != 0).any(axis=1))

This part is borrowed from the np.unique source code, and finds the unique indices as well as the "inverse index" mapping:

uniq_inds = order[uniq_mask]
inv_idx = np.zeros_like(order)
inv_idx[order] = np.cumsum(uniq_mask) - 1

Finally, sum over the unique indices:

data = np.bincount(inv_idx, weights=data)
x,y,z = XYZ[uniq_inds].T
Community
  • 1
  • 1
perimosocordiae
  • 17,287
  • 14
  • 60
  • 76
  • 2
    `data = np.bincount(inv_idx, weights=data)` is the fast way of adding the repeats in numpy. – Jaime Sep 11 '14 at 14:26
  • Thanks so much, this works perfectly. Timing this on the data I have (~120k long arrays) takes roughly 1.4 seconds. @wasserfeder 's solution seems to work but took ~28 seconds. – fox21 Sep 11 '14 at 16:05
  • Also there may be slightly faster options to find unique rows in a 2D array, there was plenty of discussion [here](http://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array/16973510#16973510). – Jaime Sep 11 '14 at 17:09
2

You can use unique and sum as reptilicus suggested to do the following

from itertools import izip
import numpy as np

x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])

# N = len(x)
# ids = x + y*N + z*(N**2)
ids = np.array([hash((a, b, c)) for a, b, c in izip(x, y, z)]) # creates flat ids

_, idx, idx_rep = np.unique(ids, return_index=True, return_inverse=True)

x_out = x[idx]
y_out = y[idx]
z_out = z[idx]
# data_out = np.array([np.sum(data[idx_rep == i]) for i in idx])
data_out = np.bincount(idx_rep, weights=data)

print x_out
print y_out
print z_out
print data_out
wasserfeder
  • 476
  • 2
  • 9