Find consecutive sequences based on Boolean array

Question

I'm trying to extract sequences from an array b for which a boolean array a is used as index (len(a) >= len(b), but (a==True).sum() == len(b), i.e. there are only as many values true in a than there are elements in b). The sequences should be represented in the result as start and end index of a where a[i] is true and for which there are consecutive values.

For instance, for the following arrays of a and b

a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]

the result should be [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])], so as many elements in the array as there are true-sequences. Each true sequence should contain the start and end index from a and the values these relate to from b).

So for the above:

[
 ((0, 1), [1, 2]),   # first true sequence: starting at index=0 (in a), ending at index=1, mapping to the values [1, 2] in b

 ((5, 7), [3, 4, 5]) # second true sequence: starting at index=5, ending at index=7, with values in b=[3, 4, 5]
]

How can this be done efficiently in numpy?

Can you show an example where `len(a)>len(b)`. I don't follow how you will choose indices in this case unless all elements of a which are not indexable in b are False? i.e., `b=[0,1,2,3]` and `a=[True, True, True, False, False, False, False]`. — FChm, Mar 26 '19 at 07:24
Thanks, you're right. It doesn't make sense. `len(b)` is actually only as long as there are true elements in _a_. I'll correct my example and text. — orange, Mar 26 '19 at 07:26
OK i understand, strange question. I guess its relatively easy to do with for loops but you are looking for a more efficient implementation. — FChm, Mar 26 '19 at 07:39
Yes, creating an efficient solution is the problem. Sorry, the requirement changed again. I realised that I need the indices from _a_ as well as the values in _b_. — orange, Mar 26 '19 at 07:53
@Chris "Fast" is a requirement. It'll be hard to do this w/o numpy, IMHO... — orange, Mar 26 '19 at 08:07
@orange IMHO, _Fast_ is very ambiguous term. If you already have some code to make a baseline, please provide it. In the meantime, feel free to try `itertools` :) — Chris, Mar 26 '19 at 08:10

Divakar · Accepted Answer · 2019-03-26T09:56:31.493

Here's one NumPy based one inspired by this post -

def func1(a,b):
    # "Enclose" mask with sentients to catch shifts later on
    mask = np.r_[False,a,False]

    # Get the shifting indices
    idx = np.flatnonzero(mask[1:] != mask[:-1])

    s0,s1 = idx[::2], idx[1::2]
    idx_b = np.r_[0,(s1-s0).cumsum()]
    out = []
    for (i,j,k,l) in zip(s0,s1-1,idx_b[:-1],idx_b[1:]):
        out.append(((i, j), b[k:l]))
    return out

Sample run -

In [104]: a
Out[104]: array([ True,  True, False, False, False,  True,  True,  True, False])

In [105]: b
Out[105]: [1, 2, 3, 4, 5]

In [106]: func1(a,b)
Out[106]: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]

Timings -

In [156]: # Using given sample data and tiling it 1000x
     ...: a = np.asarray([True, True, False, False, False, True, True, True, False])
     ...: b = [1, 2, 3, 4, 5]
     ...: a = np.tile(a,1000)
     ...: b = np.tile(b,1000)

# @Chris's soln
In [157]: %%timeit
     ...: res = []
     ...: gen = (i for i in b)
     ...: for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
     ...:     if k:
     ...:         ind, bools = list(zip(*g))
     ...:         res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
100 loops, best of 3: 13.8 ms per loop

In [158]: %timeit func1(a,b)
1000 loops, best of 3: 1.29 ms per loop

Chris · Answer 2 · 2019-03-26T08:11:49.673

Using itertools.groupby and itertools.islice:

import itertools

res = []
gen = (i for i in b)
for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
    if k:
        ind, bools = list(zip(*g))
        res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))

Output

[((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]

Insights:

itertools.groupby returns grouped object of Trues and Falses.
list[0::len(list)-1] returns first and last element of list.
Since b always have a same number of Trues, make b a generator and grab as many elements as there are Trues.

Time taken:

def itertool_version():
    res = []
    gen = (i for i in b)
    for k, g in itertools.groupby(enumerate(a), lambda x:x[1]):
        if k:
            ind, bools = list(zip(*g))
            res.append((ind[0::len(ind)-1], list(itertools.islice(gen, len(bools)))))
    return res

%timeit itertool()
7.11 µs ± 313 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

score 1 · Answer 3 · answered Mar 26 '19 at 07:50

I don't know about a solution using numpy, but maybe the following for-loop solution will help you (or others) finding a different, more efficient solution:

import numpy as np

a = np.asarray([True, True, False, False, False, True, True, True, False])
b = []
temp_list = []
count = 0
for val in a:
    if (val):
        count += 1
        temp_list.append(count) if len(temp_list) == 0 else None  # Only add the first 'True' value in a sequence
    # Code only reached if val is not true > append b if temp_list has more than 1 entry
    elif (len(temp_list) > 0):
        temp_list.append(count)  # Add the last true value in a sequence
        b.append(temp_list)
        temp_list = []
print(b)

>>> [[1, 2], [3, 5]]

score 1 · Answer 4 · answered Mar 26 '19 at 08:02

Here is my two cents. Hope it helps. [EDITED]

# Get Data
a = np.asarray([True, True, False, False, False, True, True, True, False])
b = [1, 2, 3, 4, 5]

# Assign Index names
ac = ac.astype(float)
ac[ac==1] = b


# Select edges
ac[(np.roll(ac, 1) != 0) & (np.roll(ac, -1) != 0)] = 0 # Clear out intermediates
indices = ac[ac != 0] # Select only edges
indices.reshape(2, int(indices.shape[0]/2)) # group in pairs

Output

>> [[1, 2], [3, 5]]

Camile · Answer 5 · 2019-03-26T09:46:22.733

Solution uses a method where() from numpy:

result = []
f = np.where(a)[0]
m = 1
for j in list(create(f)):
    lo = j[1]-j[0]+1
    result.append((j, [*range(m, m + lo)]))
    m += lo

print(result)
#OUTPUT: [((0, 1), [1, 2]), ((5, 7), [3, 4, 5])]

And there is a method to split an array [0 1 5 6 7] -- > [(0, 1), (5, 7)]:

def create(k):
    le = len(k)
    i = 0

    while i < le:
        left = k[i]
        while i < le - 1 and k[i] + 1 == k[i + 1]:
            i += 1
        right = k[i]
        if right - left >= 1:
            yield (left, right)
        elif right - left == 1:
            yield (left, )
            yield (right, )
        else:
            yield (left, )
        i += 1

Find consecutive sequences based on Boolean array

5 Answers5