Confusion calculating size between Bytes and Megabytes

Question

This is a solved problem where the we have to find a missing integer from the input data which is in form of a file containing 4 billion unsorted unsigned integers. The catch is that only 10 MBytes of memory can be used.

The author gives a solution where we do 2 passes. In the first pass we create an array of "blocks" of some size. Eg. Block 0 represents 0 to 999, 1 represents 1000 to 1999, so on. Thus we know how many values should be present in each block.

Now we scan through the file and count how many values are actually present in each block - which will lead us to the missing integer.

I understand the approach. However, to calculate the appropriate block size starts with this:

The array in the first pass can fit in 10 MBytes, or roughly 2^23 bytes, of memory.

How is 10 MB the same as 2^23? I am unable to understand where did the 2^23 number come from.

So 2^24 is 16MBytes, so probably he is taking 2^23 which is closest value that is 10 MBytes or less. If that is the case, why can he not use the whole 10 MBytes? i.e. why does he have to use a power of 2 to get the size here?

rcgldr · Accepted Answer · 2018-03-09T02:35:50.893

Not stated, but I'm assuming that the problem is to find the one missing integer from a file with a set of 2^32 (4294967296) unique 32 bit unsigned integers of values 0 to 4294967295. The file takes 17179869184 bytes of space.

Using 2^23 = 8388608 of memory space allows for 2^21 = 2097152 32 bit unsigned integer counts to be held in memory. Each group represents (2^32)/(2^21) = 2^11 = 2048 values. So count[0] is for values 0 to 2047, count[1] is for values 2048 to 4095, ... , count[2097151] is for values 4294965248 to 4294967295. The main line in the inner part of the loop would be count[value>>21] += 1. All counts will be 2048 except the count corresponding to the missing integer, which will have a count of 2047. The second pass will only need to read the 2047 values in the proper range to find the missing integer.

An alternative would be to use 4194304 16 bit counts with each group representing 1024 values (max count value is 1024), but there's no significant difference in time.

Assuming the 10MB = 10 * 2^20 = 10485760, then 10 * 2^18 = 2621440 32 bit counts could be used, each count for a range of 1639 values (the last count for a smaller group), which is awkward. If using 16 bit counts, then a range of 3278 values, also awkward.

So then `2^23` is more or less a random number less than 10MB and based on judgement or whatever feels right? Could I choose 5MB and would that be more or less similar? — rgamber, Mar 08 '18 at 23:44
@rgamber - the advantage of using a power of 2 for the size of the count arrays, is that the values in the file can just be right shifted to index the count array. For the first example in my answer, that would be count[file_value>>21] += 1. I updated my answer to explain this. — rcgldr, Mar 09 '18 at 02:34
Yeah, that makes sense so as to why that array size. Thanks for the explanation! — rgamber, Mar 09 '18 at 02:52

score 0 · Answer 2 · answered Mar 08 '18 at 19:48

0

I belive it should be 10*2^23 or 5*2^24 in bits. Try to see if it is in byte or bit

10 MB = 2*5*2^20*2^3
M=2^20
B=2^3b
10=2*5

answered Mar 08 '18 at 19:48

user9464192

1

score 0 · Answer 3 · answered Mar 08 '18 at 19:55

0

It's not.

1 MB is 2^20 bytes (1024 X 1024) = 1048576 10 MB is then 10485760.

2^23 = 8388608

Of course that website says 10 MB is "roughly 2^23", which could be accurate depending on what is meant by roughly.

answered Mar 08 '18 at 19:55

DaveS

105
1
1
8

That is what I gathered. 2^24 is 16MBytes, so probably he is taking 2^23 which is closest value that is 10 MBytes or less. – rgamber Mar 08 '18 at 19:56

Confusion calculating size between Bytes and Megabytes

3 Answers3