Filtering on very large array of 3 possible values

Question

I have a 2D array of 0s, 1s and 2s with very large number of columns. I am trying to select only those rows which have consecutive zeros not exceeding certain number. My method is to convert the array into characters, merge columns and then apply the regular expression filter to it. But this is very slow. Especially the conversion and joining the characters in each row. Is there a way to make it faster by an order of magnitude? Maybe using another tactic altogether?

import re
import numpy as np

n=100
k = 1000
x = np.random.choice([0,1,2], replace=True, size=(n,k))
s = np.apply_along_axis(lambda t: ''.join(t) , 1, x.astype(str))

N_ramp=3
mask = [re.search(r'[12]0{1,'+str(N_ramp)+r'}[12]', i) is None for i in s]

score 1 · Accepted Answer · answered Jul 23 '23 at 04:19

Using this answer, you can get the counts of consecutive True values. You can apply this to your problem by turning your array into a boolean array of True if the value is 0 and False otherwise. You then apply the linked algorithm to each row and check if there are any values in that result that meet your condition (the number of required consecutive zeros). I store these in a list. Printing out the sum shows how many rows meet the condition.

import numpy as np

n = 100
k = 1000
x = np.random.choice([0, 1, 2], replace=True, size=(n, k))

def get_consecutive_counts(arr):
    # https://stackoverflow.com/a/24343375/12131013
    return np.diff(np.where(np.concatenate(([arr[0]],
                                            arr[:-1] != arr[1:],
                                            [True])))[0])[::2]

def has_N_consecutive(arr, N):
    return np.any(get_consecutive_counts(arr) > N)

N_consecutive = 7
res = [has_N_consecutive(row, N_consecutive) for row in x == 0]
print(sum(res))

Filtering on very large array of 3 possible values

1 Answers1