So, I was watching one of these Google videos on how they conduct interviews (https://www.youtube.com/watch?v=XKu_SEDAykw) and I found their solution quite weird. Since there are a lot of clever people working at Google, I am wondering now if I got something wrong or if they do. Let me sum up the task and solution in case you don't want to watch it:
The task is to give an efficient algorithm for the following problem:
Given an array A of integers and a separate integer a, find two indices i,j, such that A[i]+A[j] = a.
They start off with the array being sorted and produce a nice linear time algorithm. But then the interviewer asks what happens if the array isn't sorted. And they propose the following linear time algorithm (they say that first sorting the array and then using their linear time algorithm is too slow, although it would run in nlogn time):
They go through the array from first to last and use a hash set to store the numbers they have already seen. Then they only need to check against the hash set for every index of the array (i.e. did i already see the number I need to get the sum) and since that is apparently possible in constant time, the whole algorithm is running in linear time (essentially number of hash sets * Array.length).
Now to my criticism: I think there is a huge flaw in this solution, which essentially lies in the possibility of collisions. Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries. Now, given any large input, the probability of getting a collision tends to 1 when hashing n numbers into at most logn many sets. So they trade a very modest speed increase (they assume that ten billion is large for that array, but then the log (base 2) is still less than 30. However, matching this speed with the hash set algorithm would mean that over 300 million numbers would be hashed to the same spot) for an almost definite erroneous algorithm.
I either misunderstand something with hashing or this is an awful solution for the problem. Again the safe nlogn algorithm is not much slower than the one they give, unless the array gets so big that the hash algorithm will get a collision for sure.
I wouldn't be surprised if a constant time algorithm that throws a coin for small arrays and always says yes for large arrays would have the same rate of success on average as their hash set solution.
If I misunderstand something about hashing please point that out, because I find it rather hard to believe that they would make such an error at a top notch computer engineering company.