It's an interview question:
There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with duplicate, just see if duplicate exists, if duplicate found, return True, or return False. Only 10 MB memory allowed.
Here is my solution:
Hash all these numbers into 1000 files using hash(num)%1000
, then the duplicates should fall into the same file.
After the hashing, I got 1000 small files, each of which contains 1 million
numbers at most
, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million
.
Then for each file, build a hash table to store each number and a flag
representing its occurrence.
I guess, it will take 5 B
to represent the number, 4 B
for the lower 8 digits
and 1 B
for the upper 3 digits
; and actually 1 bit
will suffice the flag
, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit
flag to each number? I'm stumbled, so I choose bool
to be the flag, 1 B
is taken.
So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B
, then each file will take 10M
for the hash table.
That's my stupid solution, Please give me a better one.
Thanks.
FOLLOW UP:
If there are
no duplicates
in these 1 billion phone numbers, given one phone number, how to find out the given oneis or is not in
these 1 billion numbers? Use as few memory as possible.
I came up with 2 solutions,
The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and
xor the given number with the one read from the file
, if the result is0
, then the given one is in the file, it'll takeO(n)
time, right?Partition
these numbers into2 small files
according to theleading bit
, which means, those numbers with aleading 1-bit
go to a file,leading 0-bit
go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file'scount
isnot full
, thenagain partition
the 1-bit file according to thesecondary leading-bit
, and check the given number recursively; if the 1-bit fileis full
, then the given number gotta be in the file, it'll takeO(logn)
time, right?