6
[1, 1, 1, 0, 0, 0, 1, 1, 0, 0]

I have a NumPy array consisting of 0's and 1's like above. How can I add all consecutive 1's like below? Any time I encounter a 0, I reset.

[1, 2, 3, 0, 0, 0, 1, 2, 0, 0]

I can do this using a for loop, but is there a vectorized solution using NumPy?

Georgy
  • 12,464
  • 7
  • 65
  • 73
user308827
  • 21,227
  • 87
  • 254
  • 417

2 Answers2

9

Here's a vectorized approach -

def island_cumsum_vectorized(a):
    a_ext = np.concatenate(( [0], a, [0] ))
    idx = np.flatnonzero(a_ext[1:] != a_ext[:-1])
    a_ext[1:][idx[1::2]] = idx[::2] - idx[1::2]
    return a_ext.cumsum()[1:-1]

Sample run -

In [91]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [92]: island_cumsum_vectorized(a)
Out[92]: array([1, 2, 3, 0, 0, 0, 1, 2, 0, 0])

In [93]: a = np.array([0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1])

In [94]: island_cumsum_vectorized(a)
Out[94]: array([0, 1, 2, 3, 4, 0, 0, 0, 1, 2, 0, 0, 1])

Runtime test

For the timings , I would use OP's sample input array and repeat/tile it and hopefully this should be a less opportunistic benchmark -

Small case :

In [16]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [17]: a = np.tile(a,10)  # Repeat OP's data 10 times

# @Paul Panzer's solution
In [18]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
10000 loops, best of 3: 73.4 µs per loop

In [19]: %timeit island_cumsum_vectorized(a)
100000 loops, best of 3: 8.65 µs per loop

Bigger case :

In [20]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [21]: a = np.tile(a,1000)  # Repeat OP's data 1000 times

# @Paul Panzer's solution
In [22]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
100 loops, best of 3: 6.52 ms per loop

In [23]: %timeit island_cumsum_vectorized(a)
10000 loops, best of 3: 49.7 µs per loop

Nah, I want really huge case :

In [24]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [25]: a = np.tile(a,100000)  # Repeat OP's data 100000 times

# @Paul Panzer's solution
In [26]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
1 loops, best of 3: 725 ms per loop

In [27]: %timeit island_cumsum_vectorized(a)
100 loops, best of 3: 7.28 ms per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • benchmark these: `a1 = np.repeat(np.arange(1000)%2, np.random.randint(1, 1000, (1000,)))`; `a2 = np.repeat(np.arange(100)%2, np.random.randint(1, 10000, (100,)))`; `a3 = np.repeat(np.random.random((100,)) < 0.2, np.random.randint(1, 10000, (100,)))` ;-) – Paul Panzer Feb 09 '17 at 06:55
  • @PaulPanzer I decided to run all of these benchmarks myself on your suggested inputs. Divakar's implementation is still faster :P... by about 2x. – rayryeng Feb 09 '17 at 07:04
  • @rayreng Also on the last one? But I admit his is better; I'm just objecting to opportunistic benchmarks ;-) – Paul Panzer Feb 09 '17 at 07:07
  • @PaulPanzer :D. I've already upvoted yours as it's very nice and in one line. The last one I just ran... it's just beaten by a slight margin. – rayryeng Feb 09 '17 at 07:08
  • @rayryeng Strange, on my laptop (100 repeats) `Divakar:0.47050797403790057;0.26334572909399867;0.3771821779664606` `Paul Panzer:0.6545195570215583;0.4508163968566805;0.17192376195453107` – Paul Panzer Feb 09 '17 at 07:17
  • @PaulPanzer Added few less opportunistic benchmarks. – Divakar Feb 09 '17 at 08:15
  • I could start a big argument on how to scale a toy example, but who cares. You can have this one. Besides, I've already conceded yours is better. :-) – Paul Panzer Feb 09 '17 at 08:23
  • @PaulPanzer You are looping, mine has some setup overhead. The only way a loopy one would be better would be if we are iterating less, i.e. either we are working with a very small array or we have less number of 1s islands. I tried your `a3` and that's `~2x` slower with the loopy one. – Divakar Feb 09 '17 at 08:54
  • @Divakar that's interesting on my laptop the 'loopy one' is twice as fast, on your rig it's the other way round and for rayreng they are apparently almost tied. (I mean with a3). Any idea? Are you testing directly from the shell? I made a small file. Could the byte compilation make a difference? – Paul Panzer Feb 09 '17 at 09:16
  • As I said earlier, if you feed very less number of 1s islands, a loopy one would be comparable or even faster. Use your `a3` and then count the number of islands with `len([c for c in np.split(a, 1 + np.where(np.diff(a))[0])])//2` and then compare this number with the length of array with `a.size`. You would see its a very very small number of islands. For the sample, OP had the no. of island as `2` for an array length of `10`. Don't want to make this a research project, as my point about when to use a vectorized solution stays the same :) – Divakar Feb 09 '17 at 09:43
  • @Divakar yes, I understand that which is why I designed the example in the first place :-)) What puzzles me are the differences between your and rayreng's and my benchmarking on the _same_ type of sample. The ratio apparently varies by a factor of 4... – Paul Panzer Feb 09 '17 at 11:36
  • @PaulPanzer Use `np.random.random((10,)) < 0.2`. you would have more erratic results across different runs. Use `np.random.random((100,)) < 0.2` for less erratic runs. Use `np.random.random((1000,)) < 0.2` and so on for lesser erratic ones, if you see the pattern there. – Divakar Feb 09 '17 at 11:42
  • @Divakar Yes, but I can run it 10 times and I do see some variation, but nowhere near enough to explain the differences. – Paul Panzer Feb 09 '17 at 11:51
  • @PaulPanzer Maybe you need to run it more number of times. Also, you need to wake up rayreng and ask him to do so too and finally ask me to do the same. Maybe we can sit together and discuss this together with politics sometime :) – Divakar Feb 09 '17 at 11:57
  • @Divakar I give up :-) – Paul Panzer Feb 09 '17 at 12:03
  • @PaulPanzer Offer a bounty and we might have an answer on that! One day a bounty might solve politics issues too, hopefully :) – Divakar Feb 09 '17 at 12:05
  • @Divakar Cursed be the devil who invented this rep system! So addictive. I'm spending way too much time here. – Paul Panzer Feb 09 '17 at 12:14
  • @PaulPanzer haha yeahh! Tell me about it! – Divakar Feb 09 '17 at 12:18
  • Would you have any idea how to do this for a 2d-array, counting on axis 1? – ConanG Dec 02 '17 at 10:43
  • @ConanG Would need considerable modifications to accommodate for 2D arrays I think. Post a new question? – Divakar Dec 02 '17 at 10:50
2

If a list comprehension is acceptable

np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99