2

I am not looking for an algorithm to the above question. I just want someone to comment on my answer.

I was asked the following question in an interview:

How to get top 100 numbers out of a large set of numbers (can't fit in memory)

And this is what I said:

Divide the numbers in batches of 1000 each. Sort each batch in "O(1)" time. Total time taken is O(n) up till now. Now take 1st 100 numbers from 1st and 2nd batch (in O(1)). Take 1st 100 from the above computed nos and the 3rd batch and so on. This will take O(n) in total - so it is an O(n) algorithm.

The interviewer replies that sorting a batch of 1000 nos. won't take O(1) time and so won't picking out 1st 100 out of a batch and after a lot of discussion he said, he doesn't have problem with the algo taking O(n) time, he just has a problem with me saying that sorting the batch takes O(1) time.

My explanation was that 1000 doesn't depend on the input (n). Irrespective of what n is, I'll always make batches of 1000 nos. and if you have to calculate, the sorting takes O(1000*log 1000)) which is essentially O(1).

If you have to make proper calculations, it would be

1000*log 1000 to sort one batch sort (n/1000) such batches takes 1000 * log 1000 * n/1000 = O(n*log(1000)) time = O(n) time

I asked a lot of my friends also about this and although they agreed with me but partially. So I wan't to know if my reasoning is 100% accurate (please criticize even if it is 99% correct).

Just remember, this post is not asking for the answer to the above posted question. I have already found a better answer at Retrieving the top 100 numbers from one hundred million of numbers

Community
  • 1
  • 1
Him
  • 318
  • 3
  • 11
  • I editted the question a bit, since you are not looking for answer for the given question - the topic of the thread should reflect what you are asking. Please re-edit if you can think of a better title for the thread, since the one I've given seems not perfect. – amit Apr 16 '12 at 07:05
  • If the complete set of numbers can't be fitted in memory, how much memory is available to do the job? – Donotalo Apr 16 '12 at 07:15
  • @Donotalo: Note that the OP is not interested in solution to the problem, but in analyzing the suggested solution. – amit Apr 16 '12 at 07:23
  • 1
    Your interviewer is yet another techie who has "learned" big-O notation from reading blog posts. Even *chess* is solvable in O(1) time because of the finite (albeit large) move space. – Deestan Apr 16 '12 at 07:25
  • My definition of O(1) is that the solution doesn't depend on the input size. So even if we have to sort 10 numbers and we are taking time proportional to "10", then the time complexity will be O(n*log(n)). But if we are sorting n numbers (n may vary from 1 to infinity), but still time taken is, say 100 seconds (constant, not variable), I would say that is O(1) – Him Apr 16 '12 at 07:38
  • @Him: if you plug concrete number in place of n in O(n^2), you'll get O(1). but i think the algorithm will still have O(n^2) complexity. – Donotalo Apr 16 '12 at 07:41
  • 2
    @Donatalo if you want to determine the complexity searching through an unsorted list of size n to determine if an element exists, the complexity is o(n). You cannot simply plugin a concrete number, because then you are altering the task. If you divide the original list into multiple lists of 100 items. And then search through all of these lists. The complexity of searching through a list of size 100 is O(1). If you don't agree, study the definition of Big-O ([here](http://stackoverflow.com/q/487258/151344) e.g.). So complexity for searching through all of the smaller lists is still O(n) – Alderath Apr 16 '12 at 13:50

2 Answers2

2

It is indeed O(n) - but the constants are very high, especially considering you will need to read each element from the filesystem twice [once in the sort, and once in the second phase], and file system access, is much slower then memory access. Since this will probably be the bottleneck of the algorithm, your solution will probably run twice slower then using a priority-queue.

Note that for a constant top 100, even the naive solution is O(n):

for each i in range(1,100):
   x <- find highest element
   remove x from the list
   append x to the solution

This solution is also O(n), since you have 100 iteration, in each iteration you need 2 traversals of the list [with some optimisations, 1 traversal per iteration can be done]. So, the total number of traversals is strictly smaller then 1000, and there are no more factors that depend on the size, thus the solution is O(n) - but it is definetly a terrible solution.

I think the interviewer meant that your solution - though O(n) has very large constants.

amit
  • 175,853
  • 27
  • 231
  • 333
2

The interviewer is wrong, but it's useful to consider why. What you're saying is correct, but there is an unstated assumption that you depend on. Possibly, the interviewer is making a different assumption.

If we say that sorting 1000 numbers is O(1), we're being a bit informal. Specifically, what we mean is that, in the limit as N goes to infinity, there is a constant greater than or equal to the cost of sorting the 1000 numbers. Since the cost of sorting the fixed-size set is independent of N, the limit isn't going to depend on N, either. Thus, it's O(1) as N goes to infinity.

A generous interpretation is that the interviewer wanted you to treat the sorting step differently. You could be more precise and say that it was O(M*log(M)) as M goes to infinity (or M goes to N, if you prefer), with M representing the size of the batches of numbers. That would make an overall O(N*log(M)) for your approach, as N and M both approach infinity. Of course, that wasn't the limit you described.

Strictly speaking, it's meaningless to say that something is O(1) without specifying the limit. One usually doesn't need to bother for algorithms, because it's clear from the context: the limit commonly taken is as a single parameter approaches infinity. Your description is correct when considering only N, but you could consider more than just N.

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88