Finding missing number using binary search

Question

I am reading book on programming pearls.

Question: Given a sequential file that contains at most four billion 32 bit integers in random order, find a 32-bit integer that isn't in the file (and there must be at least one missing). This problem has to be solved if we have a few hundred bytes of main memory and several sequential files.

Solution: To set this up as a binary search we have to define a range, a representation for the elements within the range, and a probing method to determine which half of a range holds the missing integer. How do we do this?

We'll use as the range a sequence of integers known to contain atleast one missing element, and we'll represent the range by a file containing all the integers in it. The insight is that we can probe a range by counting the elements above and below its midpoint: either the upper or the lower range has atmost half elements in the total range. Because the total range has a missing element, the smaller half must also have a mising element. These are most ingredients of a binary search algorithm for above problem.

Above text is copy right of Jon Bently from programming pearls book.

Some info is provided at following link

"Programming Pearls" binary search help

How do we search by passes using binary search and also not followed with the example given in above link? Please help me understand logic with just 5 integers rather than million integers to understand logic.

score 3 · Accepted Answer · edited May 23 '17 at 10:24

3

Why don't you re-read the answer in the post "Programming Pearls" binary search help. It explains the process on 5 integers as you ask.
The idea is that you parse each list and break it into 2 (this is where binary part comes from) separate lists based on the value in the first bit.

I.e. showing binary representation of actual numbers Original List "": 001, 010, 110, 000, 100, 011, 101 => (broken into)
(we remove the first bit and append it to the "name" of the new list)
To form each of the bellow lists we took values starting with [0 or 1] from the list above
List "0": 01, 10, 00, 11 (is formed from subset 001, 010, 000, 011 of List "" by removing the first bit and appending it to the "name" of the new list)
List "1": 10, 00, 01 (is formed from subset 110, 100, 101 of List "" by removing the first bit and appending it to the "name" of the new list)

Now take one of the resulting lists in turn and repeat the process:
List "0" becomes your original list and you break it into
List "0***0**" and
List "0***1**" (the bold numbers are again the 1 [remaining] bit of the numbers in the list being broken)

Carry on until you end up with the empty list(s).

EDIT
Process step by step:
List "": 001, 010, 110, 000, 100, 011, 101 =>
List "0": 01, 10, 00, 11 (from subset 001, 010, 000, 011 of the List "") =>
List "00": 1, 0 (from subset 01, 00 of the List "0") =>
List "000": 0 [final result] (from subset 0 of the List "00")
List "001": 1 [final result] (from subset 1 of the List "00")
List "01": 0, 1 (from subset 10, 11 of the List "0") =>
List "010": 0 [final result] (from subset 0 of the List "01")
List "011": 1 [final result] (from subset 1 of the List "01")
List "1": 10, 00, 01 (from subset 110, 100, 101 of the List "") =>
List "10": 0, 1 (from subset 00, 01 of the List "1") =>
List "100": 0 [final result] (from subset 0 of the List "10")
List "101": 1 [final result] (from subset 1 of the List "10")
List "11": 0 (from subset 10 of the List "1") =>
List "110": 0 [final result] (from subset 0 of the List "11")
List "111": absent [final result] (from subset EMPTY of the List "11")

The positive of this method is that it will allow you to find ANY number of missing numbers in the set - i.e. if more than one is missing.

P.S. AFAIR for 1 single missing number out of the complete range there is even more elegant solution of XOR all numbers.

edited May 23 '17 at 10:24

Community

1
1

answered Sep 07 '12 at 09:08

Germann Arlington

3,315
2
17
19

How you please eloborate ho we do this with XOR all numbers? – venkysmarty Sep 07 '12 at 09:22
1

The XOR all numbers method relies on the fact that the result of 1 XOR 0 = 0. Numbers that complement each other will result in 0 and only the number that does NOT have a complement will remain. Again based on 3 bit numbers (the principal works for ANY number of bits): Given a list [as above] 001, 010, 110, 000, 100, 011, 101 => 001 complements 110 (in 3 bits), so the result of `001 ^ 110 => 0`... – Germann Arlington Sep 07 '12 at 09:34
Thanks for explanation. As mentioned above why did you take list "0"? and how did you break it into List"0*0*" and List "0*1**"? Sorry I am not following logic above. Can you please eloborate? – venkysmarty Sep 07 '12 at 09:38
I updated my answer to make it clear(er) - unfortunately I am not very good at explaining in details something that seems clear to me. As to why I took List "0" - because the process is to break each of the lists (you start with 1 list, after 1st step you will have 2 lists, after 2nd step you will have 4 lists...) into 2 sub-lists. Eventually you should find (a number of) empty list(s). – Germann Arlington Sep 07 '12 at 09:51
1

@GermannArlington `1 XOR 0 = 0` maybe you meant `1 XOR 0 = 1` ? – Yno Sep 07 '12 at 11:20
@GermannArlington Why start the second pass with the list 0 first and not list 1 ? – Geek Sep 07 '12 at 11:58
@Yno I always thought that: (1 OR 0) == (0 OR 1) == 1 and XOR is an opposite of OR. I think that the absence of brackets and incorrect usage of assignment operator mislead you. What I actually meant was the result of (1 XOR 0) is 0 – Germann Arlington Sep 07 '12 at 13:18
1

@GermannArlington which is false. XOR stands for "exclusive or" which means `a XOR b` is true only if `a` is true or `b` is true, not both. – Yno Sep 07 '12 at 13:20
@Yno Oh, S**T. You are right, I don't know if I can live with myself after c***-up like that. Thanks for correction. – Germann Arlington Sep 07 '12 at 13:38
@Geek I did not try to optimise this algorithm, the idea was to show the concept. Is there any particular reason why I should start with List 1 instead of List 0? – Germann Arlington Sep 07 '12 at 14:00

Ari · Answer 2 · 2012-09-07T09:21:10.780

1

The idea is to solve easier problem:

Is the missing value in range [minVal, X] or (X, maxVal). If you know this, you can move X and check again.

For example, you have 3, 4, 1, 5 (2 is missing). You know that minVal = 1, maxVal = 5.

Range = [1, 5], X = 3, there should be 3 integers in range [1, 3] and 2 in range [4, 5]. There are only 2 in range [1, 3], so you are looking in range [1, 3]
Range = [1, 3], X = 2. There are only 1 value in range [1, 2], so you are looking in range [1, 2]
Range = [1, 2], X = 1. There are no values in range [2, 2] so it is your answer.

EDIT: Some pseudo-C++ code:

minVal = 1, maxVal = 5; //choose correct values
while(minVal < maxVal){
    int X = (minVal + maxVal) / 2
    int leftNumber = how much in range [minVal, X]
    int rightNumber = how much in range [X + 1, maxVal]
    if(leftNumber < (X - minVal + 1))maxVal = X
    else minVal = X + 1
}

edited Sep 07 '12 at 09:21

answered Sep 07 '12 at 09:08

Ari

3,101
2
27
49

It is `floor((begin + end) / 2)`, in C++ you can write simply `(begin + end) / 2`. – Ari Sep 07 '12 at 09:15
Thanks for explanation. In http://stackoverflow.com/questions/5011589/programming-pearls-binary-search-help . Here we have 000, 001, 110, 100,111. Mentioned that after first pass we have 000, 001, 110, 100,111. Then we see second bit. I am understanding how after first pass we got same values again. – venkysmarty Sep 07 '12 at 09:21
After first pass we know that missing number's first bit is 0. So Second iteration is checking if second bit is 0 or 1. There is no number with first bit 0 and second 1. All numbers with first bit 0 have second bit 0, so we got same values. – Ari Sep 07 '12 at 09:24
How Do we select that "leftNumber" and "rightNumber" fast ? guess each time it needs O(n) operations if you're counting them with a for loop :-? – SpiXel Sep 07 '12 at 09:24
@SpiXel Yes. It's because we need to solve problem in few hundred bytes of memory. So we can't load all data to memory. – Ari Sep 07 '12 at 09:27
@arri how do we know that after first iteration missing number first bit is 0? Sorry for lot of quesitons. – venkysmarty Sep 07 '12 at 09:40
@venkysmarty because there is less number with first bit 0 than with first bit 1. I don't think it is described clearer there. Look at my example code. – Ari Sep 07 '12 at 09:47

score 1 · Answer 3 · answered Sep 07 '12 at 11:11

Here's a simple C solution which should illustrate the technique. To abstract away any tedious file I/O details, I'm assuming the existence of the following three functions:

unsigned long next_number (void) reads a number from the file and returns it. When called again, the next number in the file is returned, and so on. Behavior when the end of file is encountered is undefined.
int numbers_left (void) returns a true value if there are more numbers available to be read using next_number(), false if the end of the file has been reached.
void return_to_start (void) rewinds the reading position to the start of the file, so that the next call to next_number() returns the first number in the file.

I'm also assuming that unsigned long is at least 32 bits wide, as required for conforming ANSI C implementations; modern C programmers may prefer to use uint32_t from stdint.h instead.

Given these assumptions, here's the solution:

unsigned long count_numbers_in_range (unsigned long min, unsigned long max) {
    unsigned long count = 0;

    return_to_start();

    while ( numbers_left() ) {
        unsigned long num = next_number();
        if ( num >= min && num <= max ) {
            count++;
        }
    }
    return count;
}

unsigned long find_missing_number (void) {
    unsigned long min = 0, max = 0xFFFFFFFF;

    while ( min < max ) {
        unsigned long midpoint = min + (max - min) / 2;
        unsigned long count = count_numbers_in_range( min, midpoint );

        if ( count < midpoint - min + 1 ) {
            max = midpoint;  // at least one missing number below midpoint
        } else {
            min = midpoint;  // no missing numbers below midpoint, must be above
        }
    }
    return min;
}

One detail to note is that min + (max - min) / 2 is the safe way to calculate the average of min and max; it won't produce bogus results due to overflowing intermediate values like the seemingly simpler (min + max) / 2 might.

Also, even though it would be tempting to solve this problem using recursion, I chose an iterative solution instead for two reasons: first, because it (arguably) shows more clearly what's actually being done, and second, because the task was to minimize memory use, which presumably includes the stack too.

Finally, it would be easy to optimize this code, e.g. by returning as soon as count equals zero, by counting the numbers in both halves of the range in one pass and choosing the one with more missing numbers, or even by extending the binary search to n-ary search for some n > 2 to reduce the number of passes. However, to keep the example code as simple as possible, I've left such optimizations unmade. If you like, you may want to, say, try modifying the code so that it requires at most eight passes over the file instead of the current 32. (Hint: use a 16-element array.)

Is this really a form of binary search? The input isn't sorted, so you have to go through all N elements to count the numbers in each range. — Bob Templ, Sep 19 '14 at 20:40
@BobTempl: It's a binary search in the space of 32-bit integers. We're not looking for the missing number *in* the file (which would be hopeless, since, by definition, *it's not there*); we're looking for it in the range of numbers from 0 to 0xFFFFFFFF. By counting the numbers in the intersection of a given subrange with the contents of the file, we can tell whether the missing number (or at least one of them, if several) is in that subrange. — Ilmari Karonen, Sep 19 '14 at 20:47

score 0 · Answer 4 · answered Sep 07 '12 at 11:35

Actually, if we have range of integers from a to b. Sample: [a..b]. And in this range we have b-a integers. It means, that only one is missing. And if only one is missing, we can calculate result using only single cycle. First we can calculate sum of all integers in range [a..b], which equals:

sum = (a + b) * (b - a + 1) / 2

Then we calcualate summ of all integers in our sequence:

long sum1 = 0;
for (int i = 0; i < b - a; i++)
sum1 += arr[i];

Then we can find missing element as difference of those two sums:

long result = sum1 - sum;

score 0 · Answer 5 · answered May 31 '14 at 23:26

when you've seen 2^31 zeros or ones in the ith digit place then your answer has a one or zero in the ith place. (Ex: 2^31 ones in 5th binary position means the answer has a zero in the 5th binary position.

First draft of c code:

uint32_t binaryHistogram[32], *list4BILLION, answer, placesChecked[32];
uint64_t limit = 4294967296;
uint32_t halfLimit = 4294967296/2;
int i, j, done

//General method to point to list since this detail is not important to the question.
list4BILLION = 0000000000h;


//Initialize array to zero. This array represents the number of 1s seen as you parse through the list
for(i=0;i<limit;i++)
{   
    binaryHistogram[i] = 0;
}

//Only sum up for first half of the 4 billion numbers
for(i=0;i<halfLimit;i++)
{
    for(j=0;j<32;j++)
    {
        binaryHistogram[j] += ((*list4BILLION) >> j);
    }
}

//Check each ith digit to see if all halfLimit values have been parsed
for(i=halfLimit;i<limit;i++)
{
    for(j=0;j<32;j++)
    {
        done = 1;   //Dont need to continue to the end if placesChecked are all 
        if(placesChecked[j] != 0) //Dont need to pass through the whole list
        {
            done = 0; //
            binaryHistogram[j] += ((*list4BILLION) >> j);
            if((binaryHistogram[j] > halfLimit)||(i - binaryHistogram[j] == halfLimit))
            {
                answer += (1 << j);
                placesChecked[j] = 1;
            }
        }
    }
}

Finding missing number using binary search

5 Answers5