Is Google stupid or is it me? Hash sets with possible collisions vs classic safe algorithms

Question

So, I was watching one of these Google videos on how they conduct interviews (https://www.youtube.com/watch?v=XKu_SEDAykw) and I found their solution quite weird. Since there are a lot of clever people working at Google, I am wondering now if I got something wrong or if they do. Let me sum up the task and solution in case you don't want to watch it:

The task is to give an efficient algorithm for the following problem:

Given an array A of integers and a separate integer a, find two indices i,j, such that A[i]+A[j] = a.

They start off with the array being sorted and produce a nice linear time algorithm. But then the interviewer asks what happens if the array isn't sorted. And they propose the following linear time algorithm (they say that first sorting the array and then using their linear time algorithm is too slow, although it would run in nlogn time):

They go through the array from first to last and use a hash set to store the numbers they have already seen. Then they only need to check against the hash set for every index of the array (i.e. did i already see the number I need to get the sum) and since that is apparently possible in constant time, the whole algorithm is running in linear time (essentially number of hash sets * Array.length).

Now to my criticism: I think there is a huge flaw in this solution, which essentially lies in the possibility of collisions. Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries. Now, given any large input, the probability of getting a collision tends to 1 when hashing n numbers into at most logn many sets. So they trade a very modest speed increase (they assume that ten billion is large for that array, but then the log (base 2) is still less than 30. However, matching this speed with the hash set algorithm would mean that over 300 million numbers would be hashed to the same spot) for an almost definite erroneous algorithm.

I either misunderstand something with hashing or this is an awful solution for the problem. Again the safe nlogn algorithm is not much slower than the one they give, unless the array gets so big that the hash algorithm will get a collision for sure.

I wouldn't be surprised if a constant time algorithm that throws a coin for small arrays and always says yes for large arrays would have the same rate of success on average as their hash set solution.

If I misunderstand something about hashing please point that out, because I find it rather hard to believe that they would make such an error at a top notch computer engineering company.

I edited your post to fix the formatting. In the future, please be sure you format your question for readability. — elixenide, Mar 18 '17 at 00:26
Why are you assuming a static, tiny size for the hash table? — user2357112, Mar 18 '17 at 00:35
And why are you assuming the hash table has no collision resolution strategy? — user2357112, Mar 18 '17 at 00:36

rici · Accepted Answer · 2017-03-18T14:09:55.990

6

To be clear, a "hash set" is a hash table where the key is the entire entry; there is no associated value, so the only interesting fact about a key is that it is present. This is a minor detail in the implementation of a hash table.

As has been noted, there is no basis to your statement that the size of the hash set needs to be less than log n in order to reduce lookup time. It's the other way around: the size of the hash set (the number of buckets) should be linear in the size of the dataset, so that the expected length of a hash chain is O(1). (For complexity analysis, it doesn't matter whether the expected length of a hash chain is 1 or 1,000: both are O(1).)

But even if the expected hash table lookup wasn't O(1), there is still a huge advantage to hashing over sorting: hashing is easily parallelizable. And that's something very important to Google, since only parallel algorithms can cope with Google-sized data sets.

In practice, a googly solution to this problem (I think: I haven't watched the video) would be to use two different hashes. The first hash assigns each number to a server, so it has a very large bucket size since each server has a lot of data. Each server then uses its own hash function to map its own data to its own buckets.

Now I can scan the entire dataset in parallel (using other servers), and for each entry, ask the appropriate storage server (which I work out by using the primary hash) whether the additive inverse is in its dataset. Since every entry can only be stored on one server (or set of servers, if the data is replicated for reliability), I don't actually have to disturb the irrelevant servers. (In practice, I would take a bunch of queries, bucket them by server, and then -- in parallel -- send each server a list of queries, because that reduces connection setup time. But the principal is the same.)

That's a very easy and almost infinitely scalable approach to the problem, and I think that an interviewer would be happy to hear about it. Parallel sorting is much more difficult, and in this case the complexity is totally unnecessary.

Of course, you may well have a good argument in favour of your own preferred strategy, and a good interviewer will be happy to hear a good argument as well, even if it's not one they thought of before. Good engineers are always open to discussing good ideas. And that discussion cannot start with the assumption that one of two different ideas must be "stupid".

edited Mar 18 '17 at 14:09

answered Mar 18 '17 at 01:59

rici

234,347
28
237
341

Another good solution is to have strategy. If the length of the array is greater than x (where a is A[i]+A[j] = x). We can create array of size x and only look for number that is below x. Otherwise, you can use the interviewee strategy. – invisal Mar 18 '17 at 02:20
@invisal: Does the problem specify that the entries are non-negative? – rici Mar 18 '17 at 02:22
So I rewatch the video. They said negative can happen. So yeah, ignore my comment ^_^ – invisal Mar 18 '17 at 02:29
@invisal: No, it's a good thought (unless the specification were upfront) and it illustrates that finding the best solution depends on the precise specifications of a given problem. – rici Mar 18 '17 at 02:32
I see. So the notion that looking up a whether you hashed something or not comes from the idea that you can massively parallelize the algorithm? But wouldn't that be the same if you just go through the whole array and store the original values in different small buckets on different servers? I'm sorry, but I really do not understand the big advantage that hashing apparently has over storing the real data, if you need to be 100% sure that you don't run into collisions. – Sebastian Mueller Mar 18 '17 at 08:29
Also, if you want to parallelize the whole thing I would assume that you need to address the places where they are stored and if you use O(n) many buckets, the addresses of each bucket should be of size O(log n), so just to address the buckets for the query would also take log n time. And since we have to do that in each step, the running time should also be O(n logn). I seem to be confused with this whole concept.. – Sebastian Mueller Mar 18 '17 at 08:33
@Sebastian: I think yoi misunderstand the meaning of "hash set". A hash set does not store the hash of the entries. It uses the hash of each key to decide where to put that entry, but it stores the actual entry. I suggest you read the Wikipedia entry on "hash tables", which has a good description iirc. – rici Mar 18 '17 at 14:02
@SebastianMueller you're right the cost of computing the hash would be O(log n) if you assume a bit cost model. The "time" in a hash table/set lookup is usually the number of comparisons, and not wall-time. Much of informal complexity theory (including that done in technical interviews) is not particularly technically sound, and you've found one of the inconsistencies. See: http://stackoverflow.com/questions/27127862/is-a-lookup-in-a-hash-table-o1 – Paul Hankin Mar 18 '17 at 14:21
@paul the size of a hash cannot usefully be much larger than the size of the hashed object; it is normally O(|k|) where |k| is the bitsize of a key. I emphasize the big-O because in many practical applications, the hash is asymptotically sub-linear (think of string keys, eg.) To compute the hash, you need to look at every bit in the key, so that computation is o(|k|). That needs to be factored in to the lookup time only if you include the key size as part of the problem size. And you should do that if the keys have significant size (again, strings)... – rici Mar 18 '17 at 14:43
... The argument that the hash size needs to be log n in the number of entries is a red herring. In a hash set of 32-bit integers, you could add two billion billion entries, but mist of them would be duplicates; what is relevant is the size of the key, not the quantity. Looked at that way, the time required to look up *q* bytes of key (in aggreagate) is still on average o(*q*), or o(1)/byte. – rici Mar 18 '17 at 14:48
@rici First, I should say I think your answer is excellent and pragmatic, and have upvoted. If you consider a bit/byte cost model, and parameterise in key size as well as array size, then the hashtable and sorting work out the same (both run in kn time). It's just that it's hard to pin down the exact assumptions needed to say that hashset lookup is O(1) yet sorting is O(n log n) -- to get O(1) hashing one needs to consider log or uniform cost models, but both of those are weird (for example, allowing sorting in o(n log n) time). Maybe it's unnecessary nitpicking, but I think it helps the OP. – Paul Hankin Mar 18 '17 at 15:19
@paul I really think it is nitpicking for numbers, and I react accordingly :). But when it comes to strings, particularly longish strings, it is much more important. When I worked at G I was occasionally surprised by how ingrained is the ideology of O(1) hashing (even among my exceptionally bright excolleagues) when it is obviously dependent on the length of the string being hashed, and I think concentrating on the technicality of integer bitlength somehow misses the point, which is quite practical.... – rici Mar 18 '17 at 16:38
... when you are sorting strings, you also need to think about the length of the prefix you actually need to look at in a comparison, which gets steadily longer as the sort proceeds (prefix compressuon is possible, though). Otoh, hashing *always* looks at the whole string, both to compute the hash and to verify the match. (But not to reject collisions). So there are real pragmatic considerations independent of any theoretical issues, however fascinating those might be. – rici Mar 18 '17 at 16:45
@SebastianMueller: By the way, you asked "But wouldn't that be the same if you just go through the whole array and store the original values in different small buckets on different servers?" That's exactly what my answer suggests. And how do you know which server the original value has been stored on? My answer: use a hash. In effect, you have a gigantic networked hash table, where each server is a single hash bucket. The server probably uses its own hash table (with a different hash function) to store its own values. In short, a hashmap is a way to map an small set of values to a bucket. – rici Mar 18 '17 at 22:35

score 2 · Answer 2 · answered Mar 18 '17 at 00:39

Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries

Wrong. The size of the hash table will be O(len(A)). This will not make the algorithm take more than linear expected time, since there is no multiplicative factor of the hash table size in the algorithm's runtime.

Also, while collisions are likely, hash tables are generally assumed to have some sort of collision resolution strategy. Collisions will not produce incorrect results.

Is Google stupid or is it me? Hash sets with possible collisions vs classic safe algorithms

2 Answers2