check 1 billion cell-phone numbers for duplicates

Question

It's an interview question:

There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with duplicate, just see if duplicate exists, if duplicate found, return True, or return False. Only 10 MB memory allowed.

Here is my solution:

Hash all these numbers into 1000 files using hash(num)%1000, then the duplicates should fall into the same file.

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million.

Then for each file, build a hash table to store each number and a flag representing its occurrence.

I guess, it will take 5 B to represent the number, 4 B for the lower 8 digits and 1 B for the upper 3 digits; and actually 1 bit will suffice the flag, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit flag to each number? I'm stumbled, so I choose bool to be the flag, 1 B is taken. So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B, then each file will take 10M for the hash table.

That's my stupid solution, Please give me a better one.

Thanks.

FOLLOW UP:

If there are no duplicates in these 1 billion phone numbers, given one phone number, how to find out the given one is or is not in these 1 billion numbers? Use as few memory as possible.

I came up with 2 solutions,

The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and xor the given number with the one read from the file, if the result is 0, then the given one is in the file, it'll take O(n) time, right?
Partition these numbers into 2 small files according to the leading bit, which means, those numbers with a leading 1-bit go to a file, leading 0-bit go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's count is not full, then again partition the 1-bit file according to the secondary leading-bit, and check the given number recursively; if the 1-bit file is full, then the given number gotta be in the file, it'll take O(logn) time, right?

It is evident from the rest of your post, but Americans seem to have a strange habit of confusing xxxillions ;-) — wildplasser, Oct 09 '11 at 11:33
This was one of featured questions (weekly newsletter) a while back. Foolishly I wanted to think it through myself before reading the answers... :) — Gleno, Oct 09 '11 at 11:39
Since you somehow "have" the input, aren't you already using more than 10MB? If that part doesn't count, a simple O(n^2) algorithm would use only O(1) extra — harold, Oct 09 '11 at 11:43
@Alcott for #1, think about [1, 2, 3] it returns 0 as same as [1, 1], for #2, it takes O(n) for each number(n + n / 2 + n / 4...), so it takes O(n^2) totally — lostyzd, Oct 09 '11 at 13:27
@lostyzd, for #2, takes O(n^2)? I don't get it, can you please elaborate on that? — Alcott, Oct 09 '11 at 13:31
@lostyzd, one more thing, for #1, because there is no duplicates, so `xor` should work right. — Alcott, Oct 09 '11 at 13:34
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/4119/discussion-between-lostyzd-and-alcott) — lostyzd, Oct 09 '11 at 13:44
@lostyzd, well, sorry, I just updated my question, see my `FOLLOW UP `plz. — Alcott, Oct 09 '11 at 23:22

piotr · Accepted Answer · 2011-10-09T22:19:29.820

15

Fastest solution (also in terms of programmer overhead :)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

Check that the memory usage constraint is met with pmap $(pidof sort) for example:

edited Oct 09 '11 at 22:19

answered Oct 09 '11 at 11:54

piotr

5,657
1
35
60

4

upvoted.. but note that split looks suspicious, it will split numbers, that's not so cool :) also, the merge part with 100 files is quite an overhead, this is definitely not the fastest solution (in terms of running time). – Karoly Horvath Oct 09 '11 at 12:43
2

@yi_H True, upvote to you also, the split is bogus, Should be by line numbers. I changed -b to -C in split. – piotr Oct 09 '11 at 22:18
@piotr What is that language? I cannot get anything from those symbols – Jul 24 '12 at 17:18

Karoly Horvath · Answer 2 · 2011-10-09T11:34:02.077

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right

Not true, in extreme case it's possible that one file contains all the numbers.

Create the files based on the first or last x digits of the numbers (ignore the starting 1). When creating those files you can actually chop those digits because they are equal within a file. This is a lot better than hashing because although all the numbers can still end up in one file, now the range of those numbers is limited, so you can fit it into 10MB.

Each number can be represeted by a simple bit because the only information you need is whether the number occured previously. You don't have to store the actual numbers, the address of the bit is the number. In 10MB you can store 80M bits, so you will need 1G/80M = 12.5 files, but remember, those digits must differ so actually you will need 100 files (x=2).

Finally, you don't have to create those files, you can also scan the whole file multiple times. In this case you can have multiple bit-maps in memory as one doesn't occupy 10MB.

I strongly suggest reading this book, it starts with an almost identical example: http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880

one thing I don't quite understand, if I have now the `100` files, but the numbers in each file still range from `0000 0000 - 9999 9999`, right? I still cannot represent them with `80M bits` anyway. — Alcott, Oct 09 '11 at 12:29
The first digit is always 1, according to the question, so you can discard that too, leaving you with 7 digits, meaning you'd only need a bitmap of a bit over 1 megabyte to represent each file. — Nick Johnson, Oct 10 '11 at 00:13

lostyzd · Answer 3 · 2011-10-09T12:56:02.903

6

No need for hash, 10M = 83886080 bits, put each number into [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999) (don't consider first digit), about 999999999 / 83886080 = 120 files, then build the bit set, it takes O(n) totally.

edited Oct 09 '11 at 12:56

answered Oct 09 '11 at 11:16

lostyzd

4,515
3
19
33

3

@Dialecticus That's just your understanding, which is not mentioned in that question. – lostyzd Oct 09 '11 at 11:27
@Dialecticus Since the OP's proposed solution also relies on extra disk storage, I think it's safe to say it's allowed, and the constraint is only on RAM. – Nick Johnson Oct 10 '11 at 00:11
I'm removing downvote, but I still believe that constraints are this: you have one big file on disk, and you can't use disk for anything else except reading that file. You also have 10MB of RAM at your disposal. If this collides with OP's view then I believe OP got it wrong. – Dialecticus Oct 10 '11 at 08:45
Since this is a hypothetical question it makes no sense to allow disk swap, since it's just a slower form of RAM. It's obvious that Dialecticus is right. – Gleno Nov 22 '11 at 09:39

score 2 · Answer 4 · edited May 23 '17 at 11:45

2

You can follow the bitset technique. Refer to this question and answers : Find an integer not among four billion given ones

edited May 23 '17 at 11:45

Community

1
1

answered Oct 20 '11 at 07:59

vine'th

4,890
2
27
27

Adrien Plisson · Answer 5 · 2011-10-09T12:02:05.967

1

the interview question imposes only a limit on the memory used, not on the time it takes to provide an answer.

it is thus reasonable to implement this question like this:

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

this takes an enormous amount of time for processing the billion numbers (O(n^2)), but does not take more than 10MB of memory space.

edited Oct 09 '11 at 12:02

answered Oct 09 '11 at 11:44

Adrien Plisson

22,486
6
42
73

It is understood that the candidate will adopt best performing solution. For a input as large as 1 billion O(n2) is worst solution that interviewer could be looking for. Don't forget you would be doing lots of IO to get a match in your solution. – Manish Shukla Feb 17 '14 at 17:50

score -1 · Answer 6 · answered Oct 10 '11 at 16:04

-1

You can use Bloom Filters which contains m bit array and uses k hash functions. Though I am not sure about how many hash functions you may need.

answered Oct 10 '11 at 16:04

NinjaCoder

2,381
3
24
46

check 1 billion cell-phone numbers for duplicates

6 Answers6