7

The problem is, given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file, assume only have 10 MB of memory.

Searched for some solutions, one of which is to store integers into bit-vector blocks (each block representing a specific range of integers among 4 billion range, each bit in the block represent for an integer), and using another counter for each block, to count the number of integers in each block. So that if number of integers is less than the block capacity for integers, scan the bit-vector of the block to find which are missing integers.

My confusion for this solution is, it is mentioned the optimal smallest footprint is, when the array of block counters occupies the same memory as the bit vector. I am confused why in such situation it is the optimal smallest footprint?

Here are calculation details I referred,

Let N = 2^32.
counters (bytes): blocks * 4
bit vector (bytes): (N / blocks) / 8
blocks * 4 = (N / blocks) / 8
blocks^2 = N / 32
blocks = sqrt(N/2)/4

thanks in advance, Lin

Lin Ma
  • 9,739
  • 32
  • 105
  • 175
  • 1
    [Find an integer not in list of 4 billion using 4 mb. But 4 MB is not enough](http://stackoverflow.com/q/22699104/995714), [Find a missing 32bit integer among a unsorted array containing at most 4 billion ints](http://stackoverflow.com/q/1642474/995714), [Finding a missing integer, given 4 billion unsorted integers](http://stackoverflow.com/q/22774896/995714) – phuclv Oct 10 '15 at 04:40
  • 1
    [Find missing number from 4 billion (again)](http://stackoverflow.com/q/10171570/995714) – phuclv Oct 10 '15 at 05:15
  • 1
    Of course since there is plenty of HD space, you copy the original, sort the file on disk with a bunch of fread/fseek/fwrite (O(n log n)), and voilà... no much memory (RAM) taken... – Déjà vu Oct 10 '15 at 07:11
  • @LưuVĩnhPhúc, read the post, but it does not mention how to find an optimized block size. If you have any thoughts, appreciate for sharing. :) – Lin Ma Oct 12 '15 at 06:26
  • @LưuVĩnhPhúc, for this thread, "Find missing number from 4 billion (again) ", it uses bit vector, and 10M is not enough for 4B numbers for bit vector. :) – Lin Ma Oct 12 '15 at 06:28
  • @LưuVĩnhPhúc, "Find an integer not in list of 4 billion using 4 mb. But 4 MB is not enough", this thread does not mention optimized block size either. :) – Lin Ma Oct 12 '15 at 06:29
  • @LưuVĩnhPhúc, "Find a missing 32bit integer among a unsorted array containing at most 4 billion ints", it does not discuss optimized block size either. :) – Lin Ma Oct 12 '15 at 06:31
  • @LưuVĩnhPhúc, "Finding a missing integer, given 4 billion unsorted integers", it seems does not discuss optimized block size either. :) – Lin Ma Oct 12 '15 at 06:32
  • @ring0, thanks for the comments and do you understand the formula of optimized block size in my post? Thanks. – Lin Ma Oct 12 '15 at 07:35
  • 1
    @LinMa What does your block counting solution have to do with a 'sort on file'? – Déjà vu Oct 12 '15 at 07:49
  • @ring0, I think count does not need any sorting? Please feel free to correct me if I am wrong. – Lin Ma Oct 12 '15 at 20:35

2 Answers2

5

Why it is the smallest memory footprint:

In the solution you proposed, there are two phases:

  1. Count number of integers in each block

    This uses 4*(#blocks) bytes of memory.

  2. Use a bit-vector each bit representing an integer in the block.

    This uses (blocksize/8) bytes of memory, which is (N/blocks)/8.

Setting the 2 to be equal results in blocks = sqrt(N/32) as you have mentioned.

This is the optimal because the memory required is the maximum of the memory required in each phase (which must both be executed). After the 1st phase, you can forget the counters, except for which block to search in for phase 2.

Optimization

If your counter saturates when it reaches capacity, you don't really need 4 bytes per counter, but rather 3 bytes. A counter reaches capacity when it exceeds the number of integers in the block.

In this case, phase 1 uses 3*blocks of memory, and phase 2 uses (N/blocks)/8. Therefore, the optimal is blocks = sqrt(N/24). If N is 4 billion, the number of blocks is approx 12910, and the block size is 309838 integers per block. This fits in 3 bytes.

Caveats, and alternative with good average case performance

The algorithm you proposed only works if all input integers are distinct. In case the integers are not distinct, I suggest you simply go with a randomized candidate set of integers approach. In a randomized candidate set of integers approach, you can select say 1000 candidate integers at random, and check if any are not found in the input file. If you fail, you can try find another random set of candidate integers. While this has poor worst case performance, it would be faster in the average case for most input. For example, if the input integers have a coverage of 99% of possible integers, then on average, with 1000 candidate integers, 10 of them will not be found. You can select the candidate integers pseudo-randomly so that you never repeat a candidate integer, and also to guarantee that in a fixed number of tries, you will have tested all possible integers.

If each time, you check sqrt(N) candidate integers, then the worst case performance can be as good as N*sqrt(N), because you might have to scan all N integers sqrt(N) times.

You can avoid the worst case time if you use this alternative, and if it doesn't work for the first set of candidate integers, you switch to your proposed solution. This might give better average case performance (this is a common strategy in sorting where quicksort is used first, before switching to heapsort for example if it appears that the worst case input is present).

ronalchn
  • 12,225
  • 10
  • 51
  • 61
  • Thanks ronalchn, in my original solution, when using blocks = sqrt(N/32), it means size of one block of numbers + size of whole counters could fit into memory? I am a bit lost what is the logical meaning of "Setting the 2 to be equal results". – Lin Ma Oct 10 '15 at 06:33
  • Hi ronalchn, if you could comment on my question, it will be great. :) – Lin Ma Oct 12 '15 at 06:32
  • 1
    That's not true, either one block of numbers OR the whole counters fit in memory at any one time. However, you only need to store either of them at once. After the first phase, you know which block to look in, so you can FORGET about the rest of the counters, and re-use the memory for your block of numbers. – ronalchn Oct 12 '15 at 06:34
  • Thanks ronalchn, after first phase, why one block of integer holds the whole memory is optimal? And optimal for what? I can load two or more smaller blocks (with a different block size) in 2nd phase as well? – Lin Ma Oct 12 '15 at 07:37
  • 1
    You will only have enough space to keep track of 1 block in phase 2. Phase 1 will tell you in which blocks you can find an answer. Since you only need 1 unique integer, you only need to check one of the blocks that is not filled. You won't have space to store multiple blocks, and you won't need to. – ronalchn Oct 12 '15 at 07:47
  • Thanks ronalchn for the patience to explain. My confusion is, for 10M memory, I can have 1 block of 10M footprint, and 2 blocks of 5M foot print, to be the performance is the same. Not sure why there must be 1 block fill with 10M? Please feel free to correct me if I am wrong. – Lin Ma Oct 27 '15 at 23:48
1
# assumes all integers are positive and fit into an int
# very slow, but definitely uses less than 10MB RAM
int generate_unique_integer(file input-file)
{
  int largest=0
  while (not eof(input-file))
  {
    i=read(integer)
    if (i>largest) largest=i
  }
  return i++; #larger than the largest integer in the input file
}
  • 2
    And when largest is INT_MAX? – Sami Kuhmonen Oct 10 '15 at 04:12
  • Thanks user5429970, but how your answer is related to my question? My question is not related to get max. :) – Lin Ma Oct 10 '15 at 06:29
  • 1
    @LinMa By finding the maximum integer `m` in the file, and returning the next larger integer `m+1` this does provide an answer to the question, as `m+1` is definitely *not* one of the integers in the file. Moreover, this approach requires no additional storage besides the few scalars used in the code. As others have pointed out, this approach fails if `m==INT_MAX`, as `m+1` is not representable in this case, and I do not see an easy way to fix that issue. – njuffa Oct 10 '15 at 13:16
  • Thanks all the same @njuffa. If you have any thoughts on my original question, appreciate for sharing. :) – Lin Ma Oct 11 '15 at 06:56
  • Lin Ma: I don't understand why you feel this response is not associated with your question. This is definitely a method to identify a unique integer. – user5429970 Nov 15 '15 at 17:00
  • regarding the INT_MAX issue. This is a good point, but can be easily resolved (in a slow but memory-conservative approach) with an outer loop: – user5429970 Nov 15 '15 at 17:08