4

How can I find the amount of consecutive 1s (or any other value) in each row for of the following numpy array? I need a pure numpy solution.

array([[0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 2, 0, 0, 1, 1, 1],
       [0, 0, 0, 4, 1, 0, 0, 0, 0, 1, 1, 0]])

There are two parts to my question, first: what is the maximum number of 1s in a row? Should be

array([2,3,2])

in the example case.

And second, what is the index of the start of the first set of multiple consecutive 1s in a row? For the example case this would be

array([3,9,9])

In this example I put 2 consecutive 1s in a row. But it should be possible to change that to 5 consecutive 1s in a row, this is important.

A similar question was answered using np.unique, but it only works for one row and not an array with multiple rows as the result would have different lengths.

William Miller
  • 9,839
  • 3
  • 25
  • 46
Nickpick
  • 6,163
  • 16
  • 65
  • 116
  • How is it `2` for the first row? – Divakar Feb 24 '16 at 19:08
  • corrected the desired output – Nickpick Feb 24 '16 at 19:12
  • Would you be okay with using pandas module? – Divakar Feb 24 '16 at 19:47
  • Unfortunately it needs to be pure numpy as speed it very important. Pandas will be much slower. – Nickpick Feb 24 '16 at 19:48
  • If pandas isn't slow, would you use it? – Divakar Feb 24 '16 at 19:48
  • Yes I would definitely as it would' be much easier. But I know that the speed difference is almost 50x according to some research I've done. I'm ok if on my average 2015 laptop it can complete the operation within .5sec for 1m rows. Numpy will definitely be able to do that. Pandas will be significantly slower I believe: http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ – Nickpick Feb 24 '16 at 19:49
  • I think the second question isn't closely related to the first one neglecting the fact that they are shown for the same sample case. So, I think it would make sense to make a separate question out of it. – Divakar Feb 24 '16 at 20:24
  • Sure. I'm more than happy if the first question alone is answered. I can then move forward and see what's missing. – Nickpick Feb 24 '16 at 20:25

2 Answers2

6

Here's a vectorized approach based on differentiation -

import numpy as np
import pandas  as pd

# Append zeros columns at either sides of counts
append1 = np.zeros((counts.shape[0],1),dtype=int)
counts_ext = np.column_stack((append1,counts,append1))

# Get start and stop indices with 1s as triggers
diffs = np.diff((counts_ext==1).astype(int),axis=1)
starts = np.argwhere(diffs == 1)
stops = np.argwhere(diffs == -1)

# Get intervals using differences between start and stop indices
start_stop = np.column_stack((starts[:,0], stops[:,1] - starts[:,1]))

# Get indices corresponding to max. interval lens and thus lens themselves
SS_df = pd.DataFrame(start_stop)
out = start_stop[SS_df.groupby([0],sort=False)[1].idxmax(),1]

Sample input, output -

Original sample case :

In [574]: counts
Out[574]: 
array([[0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 2, 0, 0, 1, 1, 1],
       [0, 0, 0, 4, 1, 0, 0, 0, 0, 1, 1, 0]])

In [575]: out
Out[575]: array([2, 3, 2], dtype=int64)

Modified case :

In [577]: counts
Out[577]: 
array([[0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0],
   [0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 1, 1],
   [0, 0, 0, 4, 1, 1, 1, 1, 1, 0, 1, 0]])

In [578]: out
Out[578]: array([2, 4, 5], dtype=int64)

Here's a Pure NumPy version that is identical to the previous until we have start, stop. Here's the full implementation -

# Append zeros columns at either sides of counts
append1 = np.zeros((counts.shape[0],1),dtype=int)
counts_ext = np.column_stack((append1,counts,append1))

# Get start and stop indices with 1s as triggers
diffs = np.diff((counts_ext==1).astype(int),axis=1)
starts = np.argwhere(diffs == 1)
stops = np.argwhere(diffs == -1)

# Get intervals using differences between start and stop indices
intvs = stops[:,1] - starts[:,1]

# Store intervals as a 2D array for further vectorized ops to make.
c = np.bincount(starts[:,0])
mask = np.arange(c.max()) < c[:,None]
intvs2D = mask.astype(float)
intvs2D[mask] = intvs

# Get max along each row as final output
out = intvs2D.max(1)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • it may work, but I just timed it and it's around 75x too slow. Seems the speed difference between Pandas and Numpy is huge. The difficulty is readlly to vectorise it in a pure numpy solution and in my case this is necessary. I'm sure it's possible! I measure the speed difference with similar operations I conduct on numpy arrays for 1m rows and they take less than 100ms for 1m rows. – Nickpick Feb 24 '16 at 20:57
  • @nickpick See if the edited codes make any better impact? – Divakar Feb 24 '16 at 22:17
  • Yes! Now it's the speed of numpy. What did you change? – Nickpick Feb 24 '16 at 22:20
  • @nickpick Well as commented I have put the intervals into a regular 2D array for vectorized ops. I have used this technique a lot in MATLAB and I call it as `bsxfun's masking capability`, where `bsxfun` is MATLAB's broadcasting feature. Please note that this incurs more memory usage that is dependent on the data variation. Here are few examples of it, if you are familiar with MATLAB : http://stackoverflow.com/search?tab=votes&q=user%3a3293881%20masking%20capability – Divakar Feb 24 '16 at 22:23
  • Another question, very similar (I think), if you still have some energy please have a look: http://stackoverflow.com/questions/35615111/find-5-consecutive-numbers-in-numpy-array-by-row-ignore-duplicates – Nickpick Feb 24 '16 at 23:10
  • Way late here, but just a note that this will ignore rows with zero occurrences of 1s. – Brad Solomon Jan 25 '18 at 04:01
0

I think one problem that is very similar is to check if between the sorted rows the element wise difference is a certain amount. Here if there is a difference of 1 between 5 consecutive would be as follows. It can also be done for difference of 0 for two cards:

cardAmount=cards[0,:].size
has4=cards[:,np.arange(0,cardAmount-4)]-cards[:,np.arange(cardAmount-3,cardAmount)]
isStraight=np.any(has4 == 4, axis=1)
Nickpick
  • 6,163
  • 16
  • 65
  • 116