I am not looking for an algorithm to the above question. I just want someone to comment on my answer.
I was asked the following question in an interview:
How to get top 100 numbers out of a large set of numbers (can't fit in memory)
And this is what I said:
Divide the numbers in batches of 1000 each. Sort each batch in "O(1)" time. Total time taken is O(n) up till now. Now take 1st 100 numbers from 1st and 2nd batch (in O(1)). Take 1st 100 from the above computed nos and the 3rd batch and so on. This will take O(n) in total - so it is an O(n) algorithm.
The interviewer replies that sorting a batch of 1000 nos. won't take O(1) time and so won't picking out 1st 100 out of a batch and after a lot of discussion he said, he doesn't have problem with the algo taking O(n) time, he just has a problem with me saying that sorting the batch takes O(1) time.
My explanation was that 1000 doesn't depend on the input (n). Irrespective of what n is, I'll always make batches of 1000 nos. and if you have to calculate, the sorting takes O(1000*log 1000)) which is essentially O(1).
If you have to make proper calculations, it would be
1000*log 1000 to sort one batch sort (n/1000) such batches takes 1000 * log 1000 * n/1000 = O(n*log(1000)) time = O(n) time
I asked a lot of my friends also about this and although they agreed with me but partially. So I wan't to know if my reasoning is 100% accurate (please criticize even if it is 99% correct).
Just remember, this post is not asking for the answer to the above posted question. I have already found a better answer at Retrieving the top 100 numbers from one hundred million of numbers