0

Let me excerpt this problem and solution from Jon Bentley's book Programming Pearls:

Given a file containing 4,300,000,000 integers, how can you find one that appears at least twice?

Solution:

Binary search find an element that occurs at least twice by recursively searching the subinterval that contains more than half of the integers. My original solution did not guarantee that the number of integers is halved in each iteration, so the worst case run time of its log2 n passes was proportional to n·log n. Jim Saxe reduced that to linear time by observing that the search can avoid carrying too many duplicates. When his search knows that a duplicate must be in a current range of m integers, it will only store m+1 integers on its current work tape; If more integers would have gone on the tape, his program discards them. Although his method frequently ignores input variables, its strategy is conservative enough to ensure that it finds at least one duplicate.

As I understand it, the Author's original approach would be to partition the numbers recursively and count the numbers in each partition:

  1. Test the first bit of each 32 bit integer, if it's 0, put it in the first file, if it's 1, put it in another file.
  2. The file with more than n/2 numbers must contains duplicates. Put numbers in this file into test.
  3. Test the second bit of each 32 bit integer to produce n/4 partition files.
  4. Continue this process recursively for `min(32, logn) times. If the final partition has more than one number, it's the partition.

As the author suggested, in the worst case(with n same numbers), n numbers must be tested in each pass, leading to n x logn operations and complexity of O(nlogn).

n
n
n
...

What I don't understand is how Jim Saxe's approach reduced the operation to be linear.

As the author described Jim Saxe's approach: For each partition, only store one more number than the partition should contain. For example, if all n numbers have the leading bit 0, all n numbers would be in second partition in the previous approach. With Jim's approach, only n/2 + 1 numbers are retained, the numbers tested are halved in each pass:

n
n/2 + 1
n/4 + 1
...

The operations involved is (n x logn) / 2 and the complexity is still O(nlogn), which is not linear.
Am I missing something?

hello.wjx
  • 239
  • 3
  • 13

0 Answers0