Find missing number in unsorted input with memory constraints

Question

I was reading about the problem of finding the missing number from a series of 4 billion 32 bit integers in Programming Pearls, but could not quite get the solution.

Given a sequential file that contains at most 4 Billion 32-bit integers in random order, find a 32-bit integer not on the file.
Constraint: Few hundrends of bytes of main memory but ample use of extrernal scratch files on disk

The solution described is a process where we separate the numbers in ranges using the the upper bits (I.e. in the first pass we write those with leading 0 bit to one file and those with leading 1 bit to another.We keep going using the second bit etc.) and divide in half using as new search range the half containing less than half of the numbers in the range.

I googled and found a similar post in SO which does not quite answer my question which is how the exact number is found (I undestand how the binary search fits in separate the ranges).
The answer of @Damien_The_Unbeliever seems the most relevant but from my point of view I thought that following the process we would end up with 2 files: A file with 2 numbers in range and a file with 1 number.
By subtracting the (one) number in one file with the largest of the others we can get a missing number (no need of bitmask and I am not quite sure if we could actually apply a bitmask since the range is unknown at any point).

Am I wrong on this? Could someone help figure this out?

possible duplicate of [Find a missing 32bit integer among a unsorted array containing at most 4 billion ints](http://stackoverflow.com/questions/1642474/find-a-missing-32bit-integer-among-a-unsorted-array-containing-at-most-4-billion) — Mitch Wheat, Apr 14 '12 at 09:39
@MitchWheat:I have already linked to that post and it does not answer my question.That post is about understanding the process to separate range.My question is how to find the actual number.This question is never addressed or answered in that post except a hint by `Damien_The_Unbeliever` that I already mention — Cratylus, Apr 14 '12 at 09:43
You do not need any external storage. Instead of separating numbers by bit *k*, just count them separately. — n. m. could be an AI, Apr 14 '12 at 10:26
@n.m. : I'm not quite sure what you mean. Could you elaborate in a full answer? — Li-aung Yip, Apr 14 '12 at 10:29
@Li-aung Yip: please ignore my comment, this method is not efficient. — n. m. could be an AI, Apr 14 '12 at 13:02
When you separate your number in two files and decide which file you will examine further, you fix one bit of the number you are searching. Suppose on step 1 there are 2 billion numbers with bit 0 equals 0, and 1999999999 numbers with bit 0 equals 1. You continue with the second file and make a note that bit 0 of your number equals 1. Suppose there are 999999999 numbers with bit 1 equals 0 and 1000000000 numbers with bit 1 equals 1. You continue with the first file and make a note that bit 1 your number equals 0. In the end you know all the bits. — n. m. could be an AI, Apr 14 '12 at 13:08
@n.m.:I am not sure how this process ends.I mean by setting corresponding bits and following the smaller file, you will get eventually to a file that can be divided in half exactly e.g. has 2 numbers.How you proceed once you reach this point? — Cratylus, Apr 15 '12 at 18:30
Having a file with one number, and even an empty file, is not a problem (by the way you do not necessarily get there from a file with 2 numbers; you can have a file with 1025 numbers that happen to split 1024:1). Just run the algorithm step exactly as described. Split the file in two, one part has numbers with i_th bit equals 0 and the other part has numbers with i_th bit equals 1. Either part, or both, can contain just one number, or be empty, or whatever. It does not matter at all. — n. m. could be an AI, Apr 15 '12 at 21:35

maybeWeCouldStealAVan · Answer 1 · 2012-04-15T03:26:47.543

You needn't copy the data itself or write anything to disk; just count the members of some partition of the data to identify openings. The tradeoff is between number of passes and memory (more memory allows for more counts, smaller partitions).

Here's a solution in 8 passes. We'll partition the data using 4 bits at a time. 2^4 = 16 possible values, so we'll need 64 bytes to store counts for each of the 16 partitions. So each 4 bit nibble value has an associated count.

Make a pass through the data, incrementing the associated count matching the nibble in the first four bits of the number. If a partition is full, its count will be 2^28. Pick one of the nibbles that isn't full --- this will be the first nibble of your result.

Zero your counts and make another pass, ignoring numbers that don't match the first nibble and incrementing the count associated with the second nibble in the number. A full second nibble will have a value of 2^24. Pick one that isn't full.

Proceed in this manner until you have all 8 nibbles, and there's your answer in O(N).

If you only check one bit at a time, ~~this would be a binary search~~ requiring 32 passes. (EDIT: Not really a binary search, since you still have to read the values that you're skipping. That's why it's O(N). See edit below.) If you have a KB of memory for counts, you can do it in 4 passes; with 256 KB you can do it in 2 --- actually 128 KB since you could then use short ints for your counts. Here we're constrained to a few hundred bytes --- maybe 6 bits/6 passes/256 bytes?

EDIT: Li-aung Yip's solution scales better, and clearly it can be modified to partition by more than one bit in a pass. If the writing is slower than reading, then maybe the best solution would be a hybrid between this read-only answer and Li-aung Yip's disk based one.

Do a pass counting nibbles as above, then as you count the second set of nibbles, write only the numbers (or possibly just the last 28 bits of them) that match the first nibble, into 16 files according to the second nibble.

Pick your second nibble and read it to get counts for the third nibble, writing only the numbers matching the second nibble, etc. If the file is close to capacity, if the numbers are fairly uniformly distributed, or if you pick the least-full nibbles each time, you'll have to write no more than about 6.66% (1/16+1/16^2...) of the file size.

I can not follow your solution.Especially how you end up needing `64` bytes for counts etc.Could you elaborate on this? — Cratylus, Apr 15 '12 at 18:27
Sure. For convenience, lets talk about our integers in hexadecimal, so each character is 4 bits. On our first pass, we count how many integers start with 0, with 1, with 2, ... e, f. That's sixteen different starting characters, so we'll need 16 integers, or 64 bytes, to count them all in one pass. In the 28 bits that remain, there could be up to 2^28 different combinations for each of the starting characters; if one has less than that, we know that there are numbers beginning with that character which are not in our data, so we can restrict our search to that subset, and so on. — maybeWeCouldStealAVan, Apr 17 '12 at 05:10

score 2 · Answer 2 · answered Apr 14 '12 at 10:07

2

After repeated binary partitioning of your numbers into smaller and smaller files, you'll end up with:

a bunch of files that contain two numbers, which only differ in their last significant bit
one file with only one number in it.

Get the missing number by flipping the last bit of the number that's in a file by itself.

Take the example of the numbers from 0x00 to 0x07, missing 0x04:

000
001

010
011

... (missing)
101

110
111

Take 101, flip the least significant bit, and get 100, which is the missing 0x04.

answered Apr 14 '12 at 10:07

Li-aung Yip

12,320
5
34
49

Note: I assume there's only one missing integer. – Li-aung Yip Apr 14 '12 at 10:13
+1.This is a better description of how I also suspected the solution.It seems that a bitmap is useless here – Cratylus Apr 14 '12 at 10:15
I don't see how this would change the original problem.There is at least 1 missing and this way we would still get one.Or I am missing your point? – Cratylus Apr 14 '12 at 10:16
It's a bit more complicated if there are two or more numbers missing. What happens if `010` and `011` are both missing in the example above? (It's a trivial problem, it just means you can't do it exactly the way I've described.) – Li-aung Yip Apr 14 '12 at 10:18
Well the problem specifies that there are more numbers than 1 missing and if exactly 1 number was missing from the range, I wouldn't bother with that solution.I would just add all the numbers and subtract from `N(N+1)/2` to get the number directly – Cratylus Apr 14 '12 at 10:22
If your binary partitioning routine also writes *empty* files, you can just compute what should have been in the file. – Li-aung Yip Apr 14 '12 at 10:24
If there's exactly one number missing and the number-sizes are the largest integral type, summing them will overflow at least once (it'll probably overflow on the order of N/2 times). – Vatine Apr 14 '12 at 10:26
@Vatine:You are right.But my point was that I don't think the author's intention was to provide a solution for a missin number.And my problem is that how he intents to detect one specific number from those missing from his process – Cratylus Apr 14 '12 at 10:28
@Vatine: It is not exactly hard roll your own integer with twice as much precision. But in this case it is not even needed. Just ignore the overflow and the answer will be correct. – n. m. could be an AI Apr 14 '12 at 12:32

score 1 · Answer 3 · answered May 31 '12 at 00:02

4 Billion integers are representable using a 32 bit integer. XOR ing a number with itself is a standard trick to zero out a register in assembly code. If you are guaranteed that only one number is missing, bitwise XOR on integers comes to the rescue, an O(N) solution, requiring only one additional 32 bit integer of space. Consider a simple example, a 3 bit number, thus numbers 0-7 representable and one of them is missing. Assume 6 (110) is missing missing = n1 XOR n2 XOR n3 XOR .. XOR n7. = 000 XOR 001 XOR 010 XOR 011 XOR 100 XOR 101 XOR 111

Had the problem been find the missing number between 1 and 100, you would need to start of my xoring out the elements that must be excluded. AND could be used drop from a 32 bit integer range to smaller ranges by masking out bits in the number.

Find missing number in unsorted input with memory constraints

3 Answers3