Efficient way to find missing elements in an integer sequence

Question

Suppose we have two items missing in a sequence of consecutive integers and the missing elements lie between the first and last elements. I did write a code that does accomplish the task. However, I wanted to make it efficient using less loops if possible. Any help will be appreciated. Also what about the condition when we have to find more missing items (say close to n/4) instead of 2. I think then my code should be efficient right because I am breaking out from the loop earlier?

def missing_elements(L,start,end,missing_num):
    complete_list = range(start,end+1)
    count = 0
    input_index = 0
    for item  in  complete_list:
        if item != L[input_index]:
            print item
            count += 1
        else :
            input_index += 1
        if count > missing_num:
            break



def main():
    L = [10,11,13,14,15,16,17,18,20]
    start = 10
    end = 20
    missing_elements(L,start,end,2)



if __name__ == "__main__":
    main()

How exactly do I use binary search in this case if I don't know the missing value I am looking for? — vkaul11, Jun 07 '13 at 00:04
not exactly binary search, but you can infer that the part of the list between bottom and index is completely consecutive if L[index] == L[bottom] + (index - bottom). This combined with splitting the list into two should give sublinear solution. — Lie Ryan, Jun 07 '13 at 00:27
@vkaul11: I've posted a complete solution expanding on my comment above. — Lie Ryan, Jun 07 '13 at 01:37

Martijn Pieters · Answer 1 · 2017-05-05T13:40:43.140

67

If the input sequence is sorted, you could use sets here. Take the start and end values from the input list:

def missing_elements(L):
    start, end = L[0], L[-1]
    return sorted(set(range(start, end + 1)).difference(L))

This assumes Python 3; for Python 2, use xrange() to avoid building a list first.

The sorted() call is optional; without it a set() is returned of the missing values, with it you get a sorted list.

Demo:

>>> L = [10,11,13,14,15,16,17,18,20]
>>> missing_elements(L)
[12, 19]

Another approach is by detecting gaps between subsequent numbers; using an older itertools library sliding window recipe:

from itertools import islice, chain

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

def missing_elements(L):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(missing)

This is a pure O(n) operation, and if you know the number of missing items, you can make sure it only produces those and then stops:

def missing_elements(L, count):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(islice(missing, 0, count))

This will handle larger gaps too; if you are missing 2 items at 11 and 12, it'll still work:

>>> missing_elements([10, 13, 14, 15], 2)
[11, 12]

and the above sample only had to iterate over [10, 13] to figure this out.

edited May 05 '17 at 13:40

answered Jun 06 '13 at 23:49

Martijn Pieters

1,048,767
296
4,058
3,343

2

I'm quite sure the input list is already sorted, as he named that a sequence. From what I got from the OP's question, he's trying to make it less than `O(n)`. – Rubens Jun 06 '13 at 23:53
1

@Rubens: The output of the difference between sets is a set, which does not have a defined order. Returning `sorted()` on the set returns a sorted list, however. – Martijn Pieters Jun 06 '13 at 23:54
1

@Rubens: and I am skeptical that the OP is aware enough of big-Oh notation to ask for better-than-O(n) unless I see that explicitly stated. – Martijn Pieters Jun 06 '13 at 23:57
Yes Rubens, I am trying to get more efficient algorithm. Martijn, so my method in terms of complexity is as efficient as you can get or you think there is some trick that help make it more efficient? Does not your method have an additional penalty if I had to find 5 missing items instead of 2 because I can sometime break from the loop earlier and I don't need sorting in my code. I know if we have one number missing difference in sums can help. – vkaul11 Jun 07 '13 at 00:00
@vkaul11: Your statements are contradicting a little; you know beforehand how many items are missing? So if 4 are missing, you only have to find those 4 and then you are done? – Martijn Pieters Jun 07 '13 at 00:05
Martijn, I am aware of the big-Oh notation and I wanted to know if the code can be faster/more efficient than the code I wrote. Obviously it can be more compact (like squiguy has written) but I wanted something faster or equally fast and compact. – vkaul11 Jun 07 '13 at 00:07
@Martijn, yes I know before hand how many items are missing, that is right. How expensive is the operation of converting the list to a set? The total time in your algorithm is the operation of converting the list to a set (O(n)?), taking a difference of two sets (which I guess is O(n)?) and then doing the final sort O(logn). Is that right? What exactly are we gaining by using a set here compared to my code? – vkaul11 Jun 07 '13 at 00:14
1

@vkaul11: Updated to give you a O(n) algorithm, with an option to bail out early if the number of missing elements is known up front. – Martijn Pieters Jun 07 '13 at 00:16
@vkaul11: Turning a sequence into a set is a O(n) operation. – Martijn Pieters Jun 07 '13 at 00:18
@Martijin: Thanks I changed my code also to take number of missing values? Isn't my code as efficient as yours? – vkaul11 Jun 07 '13 at 00:22
@vkaul11: Both are O(n), provided you are using Python 3, but I think mine beats yours on constant costs. We'd have to use the `timeit` module to measure actual performance here, with a spread of input scenarios. – Martijn Pieters Jun 07 '13 at 00:27
You mean the algorithm u gave that does not use sets? In Python2.7 which one will be better, mine or yours? I learned a lot from this discussion and thanks Martijin. – vkaul11 Jun 07 '13 at 00:45
@vkaul11: In 2.7 `range()` materializes a full list, so that is a loop right there. Use `xrange()` to just create a range object (like python 3 `range()` would do) instead. – Martijn Pieters Jun 07 '13 at 06:47

Lie Ryan · Accepted Answer · 2013-06-07T11:04:22.277

12

Assuming that L is a list of integers with no duplicates, you can infer that the part of the list between start and index is completely consecutive if and only if L[index] == L[start] + (index - start) and similarly with index and end is completely consecutive if and only if L[index] == L[end] - (end - index). This combined with splitting the list into two recursively gives a sublinear solution.

# python 3.3 and up, in older versions, replace "yield from" with yield loop

def missing_elements(L, start, end):
    if end - start <= 1: 
        if L[end] - L[start] > 1:
            yield from range(L[start] + 1, L[end])
        return

    index = start + (end - start) // 2

    # is the lower half consecutive?
    consecutive_low =  L[index] == L[start] + (index - start)
    if not consecutive_low:
        yield from missing_elements(L, start, index)

    # is the upper part consecutive?
    consecutive_high =  L[index] == L[end] - (end - index)
    if not consecutive_high:
        yield from missing_elements(L, index, end)

def main():
    L = [10,11,13,14,15,16,17,18,20]
    print(list(missing_elements(L,0,len(L)-1)))
    L = range(10, 21)
    print(list(missing_elements(L,0,len(L)-1)))

main()

edited Jun 07 '13 at 11:04

answered Jun 07 '13 at 01:37

Lie Ryan

62,238
13
100
144

Thanks @Lie Ryan. I will try to look at the yield from function that I am not familiar with. – vkaul11 Jun 07 '13 at 02:00
3

I love the smell of recursion in the morning – Edgar H Mar 06 '20 at 14:29
How would this be faster than just checking each item in a simple for loop? You still have to check every adjacent item, you're just doing it with a bunch of recursion thrown in there, which would slow it down further. Unless I'm missing something. – rcronk Mar 03 '21 at 00:17
@rcronk "still have to check every adjacent item" Nope, that's incorrect, with the recursive solution, only ranges that have holes need to be checked, that's why this is sublinear. If there's `k` holes, my recursive solution will find that hole `O(k*log(n)) where k <= n`, if `k` is 1 or if all the holes are grouped together in one chunk, then this will find the hole in `O(log(n))`. Or to simplify, it's `O(log(n))`. The best case scenario is when there are no holes, where it will verify that in `O(1)` because it only need to check exactly two numbers. – Lie Ryan Mar 03 '21 at 00:52
@rcronk Now, if you're trying to implement this with in real life machines, some complications do arise, real life machines like linear loops as they're easy for the branch predictor, memory prefetcher likes memory locality, and real life machines also can do parallel processing. So a real life implementation will likely use a hybrid approach where it starts with this algorithm, queue up ranges for parallel processing pool, and then switch to linear search when the ranges have been divided small enough. But in the pure algorithmic world, you don't need to do any of that for optimal solution. – Lie Ryan Mar 03 '21 at 00:57
@rcronk: in real life solution, you'll also have to consider that Python loops and function calls are gosh darn expensive. So an asymptotically worst solution like Martijn Pieters' solution might actually run much faster because it uses C loops. My solution is what your algorithms professor expects in algorithm classes, Martijn's solution is what you should actually implement if your company have this problem to solve, and they need the report yesterday. – Lie Ryan Mar 03 '21 at 01:12
@LieRyan - But how do you know a range has holes without checking it first? If there's a list of 10 items with no holes in it, which two items do you check to know there are no holes in it in O(1) time? I'm definitely missing something here. – rcronk Mar 09 '21 at 19:07
@rcronk: If the list is sorted and contains no duplicates, that means if `L[start] == L[end] - (end - start)`, then it must contain all consecutive numbers. You only need to check `L[start]` and `L[end]`. If the list contains any holes, then `end - start > L[end] - L[start]`. – Lie Ryan Mar 15 '21 at 22:41
@LieRyan - I apologize. I completely missed the part where the list was sorted. That makes complete sense. Thanks. – rcronk Mar 17 '21 at 04:24

score 3 · Answer 3 · edited Jun 06 '13 at 23:52

3

missingItems = [x for x in complete_list if not x in L]

edited Jun 06 '13 at 23:52

squiguy

32,370
6
56
63

answered Jun 06 '13 at 23:51

user2461709

39
1

2

This is not that efficient unless `L` is turned into a set. Potentially, each `in` test is O(n) complexity. – Martijn Pieters Jun 06 '13 at 23:52
1

Would this accomplish that: `[x for x in complete_list if not x in set(L)]` ? – Jason Sperske Jun 06 '13 at 23:53
Yes though the code is shorter it is O(n^2) while my code though longer is O(n). – vkaul11 Jun 06 '13 at 23:56
It would be better to make L a set in a separate line of code. – user2461709 Jun 06 '13 at 23:56
1

@JasonSperske: the `set(L)` expression would be executed for each iteration in the loop, so no. `set_L = set(L)` first, then use `if x not in set_L` in the list comprehension. – Martijn Pieters Jun 06 '13 at 23:58

score 3 · Answer 4 · edited Dec 05 '19 at 05:11

3


 a=[1,2,3,7,5,11,20]
 b=[]
 def miss(a,b):
     for x in range (a[0],a[-1]):
        if x not in a:
            b.append(x)
     return b
 print (miss(a,b))

ANS:[4, 6, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19]

works for sorted,unsorted , with duplicates too

edited Dec 05 '19 at 05:11

Avinash Dalvi

8,551
7
27
53

answered Dec 05 '19 at 05:03

joseph jenistan

31
3

The range to be `range(a[0],a[-1]+1)` if you want it to include the last element of your initial list of numbers. But overall I like this solution a lot, seems simple compared to some of the other solutions. – Brad Root Nov 16 '20 at 22:38

aldeb · Answer 5 · 2013-06-07T00:03:57.130

2

Using collections.Counter:

from collections import Counter

dic = Counter([10, 11, 13, 14, 15, 16, 17, 18, 20])
print([i for i in range(10, 20) if dic[i] == 0])

Output:

[12, 19]

edited Jun 07 '13 at 00:03

answered Jun 06 '13 at 23:57

aldeb

6,588
5
25
48

Argentum · Answer 6 · 2020-09-15T18:39:31.283

arr = [1, 2, 5, 6, 10, 12]
diff = []

"""zip will return array of tuples (1, 2) (2, 5) (5, 6) (6, 10) (10, 12) """
for a, b in zip(arr , arr[1:]):
    if a + 1 != b:
        diff.extend(range(a+1, b))

print(diff)

[3, 4, 7, 8, 9, 11]

If the list is sorted we can lookup for any gap. Then generate a range object between current (+1) and next value (not inclusive) and extend it to the list of differences.

0x90 · Answer 7 · 2013-06-07T19:15:28.770

Using scipy lib:

import math
from scipy.optimize import fsolve

def mullist(a):
    mul = 1
    for i in a:
        mul = mul*i
    return mul

a = [1,2,3,4,5,6,9,10]
s = sum(a)
so = sum(range(1,11))
mulo = mullist(range(1,11))
mul = mullist(a)
over = mulo/mul
delta = so -s
# y = so - s -x
# xy = mulo/mul
def func(x):
    return (so -s -x)*x-over

print int(round(fsolve(func, 0))), int(round(delta - fsolve(func, 0)))

Timing it:

$ python -mtimeit -s "$(cat with_scipy.py)" 

7 8

100000000 loops, best of 3: 0.0181 usec per loop

Other option is:

>>> from sets import Set
>>> a = Set(range(1,11))
>>> b = Set([1,2,3,4,5,6,9,10])
>>> a-b
Set([8, 7])

And the timing is:

Set([8, 7])
100000000 loops, best of 3: 0.0178 usec per loop

acjay · Answer 8 · 2013-06-06T23:58:50.090

0

Here's a one-liner:

In [10]: l = [10,11,13,14,15,16,17,18,20]

In [11]: [i for i, (n1, n2) in enumerate(zip(l[:-1], l[1:])) if n1 + 1 != n2]
Out[11]: [1, 7]

I use the list, slicing to offset the copies by one, and use enumerate to get the indices of the missing item.

For long lists, this isn't great because it's not O(log(n)), but I think it should be pretty efficient versus using a set for small inputs. izip from itertools would probably make it quicker still.

edited Jun 06 '13 at 23:58

answered Jun 06 '13 at 23:50

acjay

34,571
6
57
100

Is that not readable? It's all pretty basic builtin Python. In code I'd check in, I'd probably choose better identifier names, but then again, maybe not if it's all self-contained in one line. – acjay Jun 06 '13 at 23:54
I would just do it over a few lines, but that's just me :). – squiguy Jun 06 '13 at 23:54
I agree, if I'm spilling over 80 chars, I usually break on `in` and `if` in my list comprehensions. – acjay Jun 06 '13 at 23:56
this may produce an unexpected result if the missing numbers are consecutive. A second pass is needed to check for that. PS: using basic features of the language isn't the same as good readability. – Lie Ryan Jun 07 '13 at 00:04
@LieRyan Good call on the consecutive numbers. I really don't think the way I wrote it is so esoteric. If you understand enumerate and zip, it pretty much reads from left-to-right. Your solution (while awesome, sublinear, and more correct) is not really more readable in my opinion – acjay Jun 07 '13 at 02:33
@acjohnson: to clarify, I don't find your code particularly unreadable; I'm simply correcting the incorrect assertion that "it's all pretty basic built-in Python" means that it's readable. – Lie Ryan Jun 07 '13 at 17:15

score 0 · Answer 9 · answered Jun 06 '13 at 23:54

0

My take was to use no loops and set operations:

def find_missing(in_list):
    complete_set = set(range(in_list[0], in_list[-1] + 1))
    return complete_set - set(in_list)

def main():
    sample = [10, 11, 13, 14, 15, 16, 17, 18, 20]
    print find_missing(sample)

if __name__ == "__main__":
    main()

# => set([19, 12])

answered Jun 06 '13 at 23:54

mVChr

49,587
11
107
104

1

To build the sets, and to calculate the difference, Python still has to employ loops. These are just hidden in C code. They'll be faster than Python loops, and the set difference is an efficient operation, but the loops are still there. – Martijn Pieters Jun 06 '13 at 23:56
@MartijnPieters True enough, and I see you beat me to it anyway. Thanks for elaborating. – mVChr Jun 06 '13 at 23:58

score 0 · Answer 10 · answered Jun 06 '13 at 23:58

0

Simply walk the list and look for non-consecutive numbers:

prev = L[0]
for this in L[1:]:
    if this > prev+1:
        for item in range(prev+1, this):    # this handles gaps of 1 or more
            print item
    prev = this

answered Jun 06 '13 at 23:58

Brent Washburne

12,904
4
60
82

That's how I'd do it in a language without some of Python's syntactic power, but that would be a fairly plodding way to do it in Python. – acjay Jun 07 '13 at 00:01
Agreed. I like your one-liner better. – Brent Washburne Jun 07 '13 at 00:05
OTOH, this code is about the easiest to read, and it doesn't require any other libraries, sets, or syntactic tools. It would be trivial to add a counter to break out of the loop. – Brent Washburne Jun 07 '13 at 06:22

Mike Müller · Answer 11 · 2013-06-07T00:26:58.673

0

We found a missing value if the difference between two consecutive numbers is greater than 1:

>>> L = [10,11,13,14,15,16,17,18,20]
>>> [x + 1 for x, y in zip(L[:-1], L[1:]) if y - x > 1]
[12, 19]

Note: Python 3. In Python 2 use itertools.izip.

Improved version for more than one value missing in a row:

>>> import itertools as it
>>> L = [10,11,14,15,16,17,18,20] # 12, 13 and 19 missing
>>> [x + diff for x, y in zip(it.islice(L, None, len(L) - 1),
                              it.islice(L, 1, None)) 
     for diff in range(1, y - x) if diff]
[12, 13, 19]

edited Jun 07 '13 at 00:26

answered Jun 07 '13 at 00:06

Mike Müller

82,630
20
166
161

This does create two copies of the input list; those copies require loops too. You can do this with iterators (see the `itertools` module documentation for an example) where you don't need to create copies. – Martijn Pieters Jun 07 '13 at 00:08
Python 3 doesn't make slices into iterators; they are still copies. – Martijn Pieters Jun 07 '13 at 00:10
Ok. I meant for the `zip`. – Mike Müller Jun 07 '13 at 00:11
And this'll miss larger gaps. – Martijn Pieters Jun 07 '13 at 00:12
Doesn't consecutive mean a step size of 1? – Mike Müller Jun 07 '13 at 00:14
The numbers are *meant* to be consecutive; but what if 11 and 12 are missing? The missing numbers could be consecutive too. – Martijn Pieters Jun 07 '13 at 00:17
Now with iterator slices and more than one missing value in a row. – Mike Müller Jun 07 '13 at 00:29

score 0 · Answer 12 · answered Jun 07 '13 at 00:09

0

>>> l = [10,11,13,14,15,16,17,18,20]
>>> [l[i]+1 for i, j in enumerate(l) if (l+[0])[i+1] - l[i] > 1]
[12, 19]

answered Jun 07 '13 at 00:09

dansalmo

11,506
5
58
53

score 0 · Answer 13 · answered Mar 16 '16 at 10:35

def missing_elements(inlist):
    if len(inlist) <= 1:
        return []
    else:
        if inlist[1]-inlist[0] > 1:
            return [inlist[0]+1] + missing_elements([inlist[0]+1] + inlist[1:])
        else:
            return missing_elements(inlist[1:])

Victor Stanescu · Answer 14 · 2017-08-31T09:50:46.290

0

First we should sort the list and then we check for each element, except the last one, if the next value is in the list. Be carefull not to have duplicates in the list!

l.sort()

[l[i]+1 for i in range(len(l)-1) if l[i]+1 not in l]

edited Aug 31 '17 at 09:50

answered Aug 30 '17 at 11:39

Victor Stanescu

451
3
9

2

Code-only answers are often unhelpful in pointing to *why* the issue happened. You should include an explanation why it solves the issue. Please read [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) – FluffyKitten Aug 30 '17 at 12:11

score 0 · Answer 15 · answered Aug 19 '19 at 02:18

With this code you can find any missing values in a sequence, except the last number. It in only required to input your data into excel file with column name "numbers".

import pandas as pd
import numpy as np

data = pd.read_excel("numbers.xlsx")

data_sort=data.sort_values('numbers',ascending=True)
index=list(range(len(data_sort)))
data_sort['index']=index
data_sort['index']=data_sort['index']+1
missing=[]

for i in range (len(data_sort)-1):
    if data_sort['numbers'].iloc[i+1]-data_sort['numbers'].iloc[i]>1:
        gap=data_sort['numbers'].iloc[i+1]-data_sort['numbers'].iloc[i]
        numerator=1
        for j in range (1,gap):          
            mis_value=data_sort['numbers'].iloc[i+1]-numerator
            missing.append(mis_value)
            numerator=numerator+1
print(np.sort(missing))

The question is asking for a generic algorithm. Your answer is too implementation dependent. — Udo E., Aug 19 '19 at 02:41

arp · Answer 16 · 2020-01-21T20:08:47.483

I stumbled on this looking for a different kind of efficiency -- given a list of unique serial numbers, possibly very sparse, yield the next available serial number, without creating the entire set in memory. (Think of an inventory where items come and go frequently, but some are long-lived.)

def get_serial(string_ids, longtail=False):
  int_list = map(int, string_ids)
  int_list.sort()
  n = len(int_list)
  for i in range(0, n-1):
    nextserial = int_list[i]+1
    while nextserial < int_list[i+1]:
      yield nextserial
      nextserial+=1
  while longtail:
    nextserial+=1
    yield nextserial
[...]
def main():
  [...]
  serialgenerator = get_serial(list1, longtail=True)
  while somecondition:
    newserial = next(serialgenerator)

(Input is a list of string representations of integers, yield is an integer, so not completely generic code. longtail provides extrapolation if we run out of range.)

There's also an answer to a similar question which suggests using a bitarray for efficiently handling a large sequence of integers.

Some versions of my code used functions from itertools but I ended up abandoning that approach.

score 0 · Answer 17 · answered Sep 29 '21 at 07:51

A bit of mathematics and we get a simple solution. The below solution works for integers from m to n. Works for both sorted and unsorted postive and negative numbers.

#numbers = [-1,-2,0,1,2,3,5]
numbers = [-2,0,1,2,5,-1,3]

sum_of_nums =  0
max = numbers[0]
min = numbers[0]
for i in numbers:
    if i > max:
        max = i
    if i < min:
        min = i
    sum_of_nums += i

# Total : sum of numbers from m to n    
total = ((max - min + 1) * (max + min)) / 2

# Subtract total with sum of numbers which will give the missing value
print total - sum_of_nums

Efficient way to find missing elements in an integer sequence

17 Answers17

Linked