3

I have a sorted array of ints which might have repetitions. I would like to count consecutive equal values, restarting from zero when a value is different from the previous one. This is the expected result implemented with a simple python loop:

import numpy as np

def count_multiplicities(a):
    r = np.zeros(a.shape, dtype=a.dtype)
    for i in range(1, len(a)):
        if a[i] == a[i-1]:
            r[i] = r[i-1]+1
        else:
            r[i] = 0
    return r

a = (np.random.rand(20)*5).astype(dtype=int)
a.sort()

print "given sorted array: ", a
print "multiplicity count: ", count_multiplicities(a)

Output:

given sorted array:  [0 0 0 0 0 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4]
multiplicity count:  [0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3]

How can I get the same result in an efficient way using numpy? The array is very long, but the repetitions are just a few (say no more than ten).

In my special case I also know that values start from zero and that the difference between consecutive values is either 0 or 1 (no gaps in values).

Divakar
  • 218,885
  • 19
  • 262
  • 358
Emanuele Paolini
  • 9,912
  • 3
  • 38
  • 64
  • further thought: maybe a possible solution can be achieved by using a multiplication with well choosen matrix with given diagonal and upper diagonal. – Emanuele Paolini Jul 26 '17 at 09:14

2 Answers2

3

Here's one cumsum based vectorized approach -

def count_multiplicities_cumsum_vectorized(a):      
    out = np.ones(a.size,dtype=int)
    idx = np.flatnonzero(a[1:] != a[:-1])+1
    out[idx[0]] = -idx[0] + 1
    out[0] = 0
    out[idx[1:]] = idx[:-1] - idx[1:] + 1
    np.cumsum(out, out=out)
    return out

Sample run -

In [58]: a
Out[58]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4])

In [59]: count_multiplicities(a) # Original approach
Out[59]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])

In [60]: count_multiplicities_cumsum_vectorized(a)
Out[60]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])

Runtime test -

In [66]: a = (np.random.rand(200000)*1000).astype(dtype=int)
    ...: a.sort()
    ...: 

In [67]: a
Out[67]: array([  0,   0,   0, ..., 999, 999, 999])

In [68]: %timeit count_multiplicities(a)
10 loops, best of 3: 87.2 ms per loop

In [69]: %timeit count_multiplicities_cumsum_vectorized(a)
1000 loops, best of 3: 739 µs per loop

Related post.

Divakar
  • 218,885
  • 19
  • 262
  • 358
1

I would use numba on such problems

import numba
nb_count_multiplicities = numba.njit("int32[:](int32[:])")(count_multiplicities)
X=nb_count_multiplicities(a)

Without rewriting your code at all it is about 50 percent faster than Divakar's vectorized solution.

Vectorizing is many times useful if it results in a shorter and maybe easier understandable code, but if you forcefully have to vectorize a code which could also be a problem for a quite expirienced programmer numba is the way to go.

max9111
  • 6,272
  • 1
  • 16
  • 33