I was reading about the problem of finding the missing number from a series of 4 billion 32 bit integers in Programming Pearls
, but could not quite get the solution.
Given a sequential file that contains at most 4 Billion 32-bit integers in random order, find a 32-bit integer not on the file.
Constraint: Few hundrends of bytes of main memory but ample use of extrernal scratch files on disk
The solution described is a process where we separate the numbers in ranges using the the upper bits (I.e. in the first pass we write those with leading 0
bit to one file and those with leading 1
bit to another.We keep going using the second bit etc.) and divide in half using as new search range the half containing less than half of the numbers in the range.
I googled and found a similar post in SO which does not quite answer my question which is how the exact number is found (I undestand how the binary search fits in separate the ranges).
The answer of @Damien_The_Unbeliever seems the most relevant but from my point of view I thought that following the process we would end up with 2 files: A file with 2 numbers in range and a file with 1 number.
By subtracting the (one) number in one file with the largest of the others we can get a missing number (no need of bitmask and I am not quite sure if we could actually apply a bitmask since the range is unknown at any point).
Am I wrong on this? Could someone help figure this out?