One approach with np.unique
to generate those unique tags and the interval shifting indices and then np.add.reduceat
for the intervaled-summing
-
_,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Another way that avoids the use of np.unique
and might be beneficial on performance would be like so -
idx = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1]
tag_arr = np.zeros(a.shape[0], dtype=int)
tag_arr[idx[1:]] = 1
tags = tag_arr.cumsum()
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
For further performance boost, we should use np.bincount
. Thus, np.add.reduceat(a[:,1],idx)
could be replaced by np.bincount(tags, a[:,1])
.
Sample run -
In [271]: a # Using a more generic sample
Out[271]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [272]: _,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
In [273]: np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Out[273]:
array([[11, 4, 8],
[11, 4, 8],
[14, 8, 16],
[14, 8, 16],
[16, 10, 20],
[16, 10, 20]])]
Now, the listed approaches assume that the first column is already sorted. If that's not the case, we need to sort the array by the first column argsort
and then use the proposed method. Thus, for the not sorted case, we need the following as pre-processing -
a = a[a[:,0].argsort()]
Battle against np.unique
Let's time the custom flatnonzero
+ cumsum
based method against the built-in np.unique
to create the shifting indices : idx
and the uniqueness based IDs/tags : tags
. For a case like this one, where we know beforehand that the labels column is already sorted, we are avoiding any sorting, as done with np.unique
. This gives us an advantage on performance. So, let's verify it.
Approaches -
def nonzero_cumsum_based(A):
idx = np.concatenate(( [0] ,np.flatnonzero(A[1:] > A[:-1])+1 ))
tags = np.zeros(len(A), dtype=int)
tags[idx[1:]] = 1
np.cumsum(tags, out = tags)
return idx, tags
def unique_based(A):
_,idx,tags = np.unique(A, return_index=1, return_inverse=1)
return idx, tags
Sample run with the custom func -
In [438]: a
Out[438]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [439]: idx, tags = nonzero_cumsum_based(a[:,0])
In [440]: idx
Out[440]: array([0, 2, 4])
In [441]: tags
Out[441]: array([0, 0, 1, 1, 2, 2])
Timings -
In [444]: a = np.c_[np.sort(randi(10,10000,(100000))), randi(0,10000,(100000))]
In [445]: %timeit unique_based(a[:,0])
100 loops, best of 3: 4.3 ms per loop
In [446]: %timeit nonzero_cumsum_based(a[:,0])
1000 loops, best of 3: 486 µs per loop
In [447]: a = np.c_[np.sort(randi(10,10000,(1000000))), randi(0,10000,(1000000))]
In [448]: %timeit unique_based(a[:,0])
10 loops, best of 3: 50.2 ms per loop
In [449]: %timeit nonzero_cumsum_based(a[:,0])
100 loops, best of 3: 3.98 ms per loop