7

I have got a sequence of strings - 0000001, 0000002, 0000003.... upto 2 million. They are not contiguous. Meaning there are gaps. Say after 0000003 the next string might be 0000006. I need to find out all these gaps. In the above case (0000004, 0000005).

This is what I have done so far -

gaps  = list()
total = len(curr_ids)

for i in range(total):
    tmp_id = '%s' %(str(i).zfill(7))
    if tmp_id in curr_ids:
        continue
    else:
        gaps.append(tmp_id)
return gaps

But as you would have guessed, this is slow since I am using list. If I use a dict, to pre-populate curr_ids it'll be faster. But what's the complexity to populating a hash-table? What's the fastest way to do this.

Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264

4 Answers4

11

You could sort the list of ids and then step through it once only:

def find_gaps(ids):
    """Generate the gaps in the list of ids."""
    j = 1
    for id_i in sorted(ids):
        while True:
            id_j = '%07d' % j
            j += 1
            if id_j >= id_i:
                break
            yield id_j

>>> list(find_gaps(["0000001", "0000003", "0000006"]))
['0000002', '0000004', '0000005']

If the input list is already in order, then you can avoid the sorted (though it does little harm: Python's adaptive mergesort is O(n) if the list is already sorted).

Gareth Rees
  • 64,967
  • 9
  • 133
  • 163
3

For storing sequence of 2 millions ints you can use bitarray. Here each bit means one integer (the integer of that index in bitarray). Example code:

gaps = []
# bitarray is 0 based
a = bitarray.bitarray(total + 1)
a.setall(False)
for sid in curr_ids:
    a[int(sid)] = True
for i in range(1, total):
    if not a[i]:
        gaps.append('%07d' %(i))
return gaps
Michał Niklas
  • 53,067
  • 18
  • 70
  • 114
1
seq = *the sequence of strings*
n = 2000000

gaps = set(str(i).zfill(7) for i in range(1,n+1)) - set(seq)
jairajs89
  • 4,495
  • 3
  • 18
  • 14
0

I would suggest take it int rather than string for processing and then making it a string again in output

j=0
n=2000000
#create a list of int number from your string
foo = [i for i in range(n)]
#creating gaps
foo.remove(1)
foo.remove(50)
while j<n:
    for i in foo:
        if i>j:
            print '%07d'%j
            j+=1
        j+=1
Rafi
  • 805
  • 6
  • 12