Finding data gaps with bit masking

Question

I'm faced with a problem of finding discontinuities (gaps) of a given length in a sequence of numbers. So, for example, given [1,2,3,7,8,9,10] and a gap of length=3, I'll find [4,5,6]. If the gap is length=4, I'll find nothing. The real sequence is, of course, much longer. I've seen this problem in quite a few posts, and it had various applications and possible implementations.

One way I thought might work and should be relatively quick is to represent the complete set as a bit array containing 1 for available number and 0 for missing - so the above will look like [1,1,1,0,0,0,1,1,1,1]. Then possibly run a window function that'll XOR mask an array of the given length with the complete set until all locations result in 1. This will require a single pass over the whole sequence in roughly ~O(n), plus the cost of masking in each run.

Here's what I managed to come up with:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

This is fairly quick for ~100k (< 1 sec). I'd appreciate tips on how to make this faster / more efficient for larger sets. thanks!

I musn't have understood the problem right. Can't you just look for adjacent elements for which `a[i + 1] - a[i] == gap + 1`? — Marcelo Cantos, Dec 07 '10 at 10:03
@Marcelo I think you really can, and that the OP is seriously overcomplicating things, perhaps based on some poorly-understood ideas about optimization. I wrote my answer on this assumption. — Karl Knechtel, Dec 07 '10 at 10:05
It would be important to know whether you want to look for gaps of several different lengths in the same sequence. If you'd want to look iteratively for gaps of length 1 to 100, then it might be worth the effort of transforming the sequence first. — Johannes Charra, Dec 07 '10 at 10:11
Do you simply want a list that complements the provided list having gaps matching the exact requested size? [1,2,5,8] length=2 -> [3,4,6,7]? — kevpie, Dec 07 '10 at 10:15
I want the first number that begins the gap. So using my examples above, calling `find_gap([1,1,1,0,0,0,1,1,1,1], 0, 3)` will return `4`. Also, I specified `start` and not idx to allow searching from from an arbitrary number. Sorry if I was unclear. — sa125, Dec 07 '10 at 10:20

score 2 · Answer 1 · answered Dec 07 '10 at 10:01

2

Find the differences between adjacent numbers, and then look for a difference that's large enough. We find the differences by constructing two lists - all the numbers but the first, and all the numbers but the last - and subtracting them pairwise. We can use zip to pair the values up.

def find_gaps(numbers, gap_size):
    adjacent_differences = [(y - x) for (x, y) in zip(numbers[:-1], numbers[1:])]
    # If adjacent_differences[i] > gap_size, there is a gap of that size between
    # numbers[i] and numbers[i+1]. We return all such indexes in a list - so if
    # the result is [] (empty list), there are no gaps.
    return [i for (i, x) in enumerate(adjacent_differences) if x > gap_size]

(Also, please learn some Python idioms. We prefer direct iteration, and we have a real boolean type.)

answered Dec 07 '10 at 10:01

Karl Knechtel

62,466
11
102
153

He doesn't want differences >= the gap size, only == the gap size. – Johannes Charra Dec 07 '10 at 10:14
So, `... if x == gap_size + 1` (per the description, there is a size-3 gap between 3 and 7, so the gap size is one less than the difference). :) – Karl Knechtel Dec 07 '10 at 10:16
Ok, +1 (dto. for your answer ;)) – Johannes Charra Dec 07 '10 at 10:21
very cool, though I need to tweak it a bit to make it work for me. Not sure what you mean by *direct iteration*. I'm aware of the boolean type, just used 0/1 since it worked better for my version :) - thanks! – sa125 Dec 07 '10 at 10:35

score 2 · Answer 2 · answered Dec 08 '10 at 10:21

You could use XOR and shift and it does run in roughly O(n) time.

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

NPE · Answer 3 · 2010-12-07T10:58:00.263

1

If it's efficiency you're after, I'd do something along the following lines (where x is the list of sequence numbers):

for i in range(1, len(x)):
  if x[i] - x[i - 1] == length + 1:
    print list(range(x[i - 1] + 1, x[i]))

edited Dec 07 '10 at 10:58

answered Dec 07 '10 at 10:18

NPE

486,780
108
951
1,012

Johannes Charra · Answer 4 · 2010-12-07T10:34:03.820

1

Pretty much what aix did ... but getting only the gaps of the desired length:

def findGaps(mylist, gap_length, start_idx=0):
    gap_starts = []
    for idx in range(start_idx, len(mylist) - 1):
        if mylist[idx+1] - mylist[idx] == gap_length + 1:
            gap_starts.append(mylist[idx] + 1)

    return gap_starts

EDIT: Adjusted to the OP's wishes.

edited Dec 07 '10 at 10:34

answered Dec 07 '10 at 10:28

Johannes Charra

29,455
6
42
51

score 1 · Answer 5 · answered Dec 07 '10 at 10:51

These provide a single walk of your input list.

List of gap values for given length:

from itertools import tee, izip
def gapsofsize(iterable, length):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x == length+1 for p in xrange(x+1,y) )

print list(gapsofsize([1,2,5,8,9], 2))

[3, 4, 6, 7]

All gap values:

def gaps(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( p for x, y in izip(a, b) if y-x > 1 for p in xrange(x+1,y) )

print list(gaps([1,2,4,5,8,9,14]))

[3, 6, 7, 10, 11, 12, 13]

List of gaps as vectors:

def gapsizes(iterable):
    a, b = tee(iterable)
    next(b, None)
    return ( (x+1, y-x-1) for x, y in izip(a, b) if y-x > 1 )

print list(gapsizes([1,2,4,5,8,9,14]))

[(3, 1), (6, 2), (10, 4)]

Note that these are generators and consume very little memory. I would love to know how these perform on your test dataset.

Finding data gaps with bit masking

5 Answers5

Linked