Assume the numbers in the file are already sorted, this is an improved version of @ilmiacs's solution.
def find_missing(f, line_number_ub):
missing = []
next_expected = 1
for i in map(int, f):
# The logic is correct without if, but adding it can greatly boost the
# performance especially when the percentage of missing numbers is small
if next_expected < i:
missing += range(next_expected, i)
next_expected = i + 1
missing += range(next_expected, line_number_ub)
return missing
with open(path,'r') as f:
print(*find_missing(f, 10**12), sep='\n')
If a generator is preferred over a list, you can do
def find_missing_gen(f, line_number_ub):
missing = []
next_expected = 1
for i in map(int, f):
if next_expected < i:
yield from range(next_expected, i)
next_expected = i + 1
yield from range(next_expected, line_number_ub)
with open(path,'r') as f:
print(*find_missing_gen(f, 10**12), sep='\n')
And following is some performance measurement using a list of strings from 1 to 9999 with 100 missing values (randomly selected):
(find_missing) 2.35 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing w/o if) 4.67 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(@blhsing's solution) 3.54 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen) 2.35 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen w/o if) 4.42 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You may do some prelimary tests on your machine to see the performance of handling 1GB files in order to estimate whether the performance of handling 100GB files reaches your requirement. If not, you could consider further optimizations such as reading the file in blocks and using more advanced algorithms to find the missing numbers.