Why it is the smallest memory footprint:
In the solution you proposed, there are two phases:
Count number of integers in each block
This uses 4*(#blocks)
bytes of memory.
Use a bit-vector each bit representing an integer in the block.
This uses (blocksize/8)
bytes of memory, which is (N/blocks)/8
.
Setting the 2 to be equal results in blocks = sqrt(N/32)
as you have mentioned.
This is the optimal because the memory required is the maximum of the memory required in each phase (which must both be executed). After the 1st phase, you can forget the counters, except for which block to search in for phase 2.
Optimization
If your counter saturates when it reaches capacity, you don't really need 4 bytes per counter, but rather 3 bytes. A counter reaches capacity when it exceeds the number of integers in the block.
In this case, phase 1 uses 3*blocks
of memory, and phase 2 uses (N/blocks)/8
. Therefore, the optimal is blocks = sqrt(N/24)
. If N
is 4 billion, the number of blocks is approx 12910, and the block size is 309838 integers per block. This fits in 3 bytes.
Caveats, and alternative with good average case performance
The algorithm you proposed only works if all input integers are distinct. In case the integers are not distinct, I suggest you simply go with a randomized candidate set of integers approach. In a randomized candidate set of integers approach, you can select say 1000 candidate integers at random, and check if any are not found in the input file. If you fail, you can try find another random set of candidate integers. While this has poor worst case performance, it would be faster in the average case for most input. For example, if the input integers have a coverage of 99% of possible integers, then on average, with 1000 candidate integers, 10 of them will not be found. You can select the candidate integers pseudo-randomly so that you never repeat a candidate integer, and also to guarantee that in a fixed number of tries, you will have tested all possible integers.
If each time, you check sqrt(N)
candidate integers, then the worst case performance can be as good as N*sqrt(N)
, because you might have to scan all N integers sqrt(N)
times.
You can avoid the worst case time if you use this alternative, and if it doesn't work for the first set of candidate integers, you switch to your proposed solution. This might give better average case performance (this is a common strategy in sorting where quicksort is used first, before switching to heapsort for example if it appears that the worst case input is present).