3

From the book Programming Pearls 2nd Edition, quoting the question A from Column 2, section 2.1

Given a sequential file that contains at most four billion integers in random order, find a 32-bit integer that isn't in the file (and there must be at least one missing - why ?). How would you solve this problem with ample quantities of main memory ? How would you solve it if you could use several external "scratch" files but only a few hundred bytes of main memory ?

The problem statement says "at most" four billion integers. So one valid input could be in the range 100 - 299 with one missing number. If this understanding of the problem statement is correct then the required inputs for this problem are the file containing the numbers and also the range of the numbers in the file, ie: i to n.

For this problem isn't the following O(n) solution more intuitive than the one given in the book (from Ed Reingold) ? Or am I missing something ?

Assume the given range is i...n

using the forumla (n * (n + 1) / 2) 
  x = the sum of numbers from 1 to i-1 
  y = the sum of numbers from 1 to n 
walk through the input and get a sum of the numbers (value z)

missing number = (y - x - z)
Manohar
  • 3,865
  • 11
  • 41
  • 56
  • 2
    You are missing something: (1) The numbers don't said to be *unique*, the summation approach assumes unique elements and only one missing in the range (2) You don't know the range of the numbers, in 32 bit integer can have much more then 4B elements. – amit Dec 28 '12 at 09:58
  • You are making a lot of assumptions, which going by the question statement seem wrong – asheeshr Dec 28 '12 at 10:01
  • The problem statement doesn't mean that there is only one missing number (or that there are no duplicates). Your algorithm is solving a different problem where there is exactly one missing number from a range. – Michael Burr Dec 28 '12 at 10:02
  • There are slightly more than 4 billion integers that would fit in 32 bits. So we have an upper range on the number of possible values. With plenty of memory, you could use a bitmap to track the numbers you have found. With just a small amount of memory, you will have to work harder. :-) – Bo Persson Dec 28 '12 at 10:04
  • @MichaelBurr, but it says `find a 32-bit integer`. A 32-bit integer === One 32-bit integer? – Zolbayar Dec 23 '21 at 03:18

1 Answers1

4

You are missing something:

  1. The numbers don't said to be unique, the summation approach assumes unique elements and only one missing in the range
  2. You don't know the range of the numbers, in 32 bit integer can have much more then 4B elements (to be exact, there are 294967296 more numbers that can be represented by 32 bits then 4B)

Look at the simplified counter example with range = [1,5], array = [5,5,5,4,1].
In this case, you will get x=1, y = 15, z = 20.
However, 20-15-1 = 4, and it is not missing.

You can use radix sort, which runs in O(n) in this case (because 32 bits is constant), and then scan the sorted array to find the first missing element.

EDIT: A more efficient way to do it is yet with a variation of radix sort and selection algorithm:

  1. Let your current number of bits be k. Look at the first bit, and partition the array into two parts, the first with this bit unset, and the second with this bit set.
  2. In at least one of the parts - there must be less then (2^k) / 2 = 2^(k-1) elements.
  3. Reduce the problem to this sub array only and use k' = k-1 (reduce the current number of bits) and return to 1.

Keep doing it until you exhausted your bits, and you will get a number that is not in the original list.

Note that assuming the list is random enough, the complexity of the algorithm is O(n) - for any number of bits (and not O(n*k))

amit
  • 175,853
  • 27
  • 231
  • 333
  • Agree with you that the uniqueness assumption is not stated in the question. But I think the range is still required because the question only says at most 4B integers. If you think the range is not required can you explain how the solution (in the section towards the end of the book) given in the book will work ? – Manohar Dec 28 '12 at 10:07
  • @Santhosh The range is given by the fact that they are 32-bit integers; 0 to 2^32-1 resp. -2^31 to 2^31-1, depending on whether unsigned or signed. – Daniel Fischer Dec 28 '12 at 10:09
  • @Daniel for a 32 bit integer the range is known, of course. But the input file could contain a small subset. The range I was referring to is the min, mix for a given input instance. – Manohar Dec 28 '12 at 10:12
  • @Santhosh: But in your input instance, it is very possible that the max is INT_MAX and the lowest is INT_MIN, no one guaranteed the elements are sequential. – amit Dec 28 '12 at 10:14
  • @Santhosh And that you don't know, and need not know. That's one of the points. – Daniel Fischer Dec 28 '12 at 10:14
  • If you guys have access to the book would you mind explaining how the solution given in the book will work ? I am not sure I got it right. – Manohar Dec 28 '12 at 10:17
  • @Santhosh: You might want to post it as a new question, because it is. Also - have a look at my suggested solution (In edit) that solved the problem in `O(n)` for any number of bits. – amit Dec 28 '12 at 10:19
  • http://stackoverflow.com/questions/5011589/programming-pearls-binary-search-help has an answer than explains the solution given in the book.. but I think in the presence of duplicate numbers the binary search based approach may not work.. – Manohar Dec 28 '12 at 10:49
  • @Santhosh Duplicates don't interfere at all with the binary search explained there. At each point, you have a list of numbers with given prefix that is smaller than the number of integers with that prefix. If there are duplicates among them, that only means that yet more of the integers with that prefix are missing. – Daniel Fischer Dec 28 '12 at 11:12
  • @amit A follow up question: 1)In point 1 when you say look at the first bit, do you mean the MSB or LSB? – Geek Aug 28 '13 at 06:21