How to efficiently find 10 greatest numbers from billions of numbers?

Question

Problem Statement: Find 10 maximum numbers from a file which contains billions of numbers

Input: 97911 98855 12345 78982 ..... .....

I actually came up with the below solution which has

best case complexity O(n) - When file has numbers in descending order
worst case complexity O(n*10) ~ O(n) When the file has numbers in ascending order
Average complexity ~ O(n)

Space complexity is O(1) in all cases

I am reading the file using a file reader and an sorted Array which stores the maximum 10 numbers. I will check if the currentLine is greater than the smallest element in the array - If so will insert it in the correct position by swapping.

Scanner sc = new Scanner(new FileReader(new File("demo.txt")));
int[] maxNum = new int[10];
    while(sc.hasNext()){
    int phoneNumber = Integer.parseInt(sc.nextLine());
    if(phoneNumber>maxNum[9]){
        maxNum[9] = phoneNumber;
        for(int i =9;i>0;i--){
            if(maxNum[i]>maxNum[i-1]){
                int temp = maxNum[i];
                maxNum[i] = maxNum[i-1];
                maxNum[i-1] = temp;
            }
        }
    }
    }

I am looking for feedback if there are better ways to implement this

You can use the built in methods to find the max value, whenever you found a maximum value store this value then remove it , then do it again 10 times. — Null, Jan 15 '17 at 07:42
@Null . .Which built in method do you suggest .. will it not need multiple passes and more iterations — AdityaReddy, Jan 15 '17 at 07:43
check [this](http://stackoverflow.com/a/12054461/4834682) , it will help you. — Null, Jan 15 '17 at 07:43
@Jordan Its assumed that they would fall in the Integer range .. `value < Integer.MAX_VALUE` — AdityaReddy, Jan 15 '17 at 07:59
I'm voting to close this question as off-topic because this question is more appropriate on [CodeReview.SE]. — Code-Apprentice, Jan 15 '17 at 08:11
The worst case is not when the file is reverse-sorted. It is when it is randomly ordered, in which case you have to sort, which at a minimum is *O(log(N))*. — user207421, Jan 15 '17 at 08:53
@EJP ... When its in ascending order .. its the worst case right .. as there are 10 swaps involved for each new number ... if its reverse sorted .. then there is no swap required at all ... And I guess there is no need to sort all the billion numbers as we just need top 10 — AdityaReddy, Jan 15 '17 at 08:55
I downvote it as you weren't after a good answer but you looking only for support and encouragement for your code. — Saeed Amiri, Jan 18 '17 at 08:14
@AdityaReddy, it is not about being happy. It is about asking a real problem. After my previous comment I waited for a while to be sure you saw the comment. But if you saw it and you are only after happiness(e.g. people say you are doing well, while it is not true), then I think here is not a right place. — Saeed Amiri, Jan 18 '17 at 08:25
For this specific question, idea was not to use max heap and it was to use multithreading. If you don't know how to use multithreading over one file you can test the difference in e.g. over multiple file to understand the main difference. If you don't know multithreading and you don't want to learn it, don't ask a question. — Saeed Amiri, Jan 18 '17 at 08:53

Henry · Accepted Answer · 2017-01-15T07:50:01.633

4

If the file is not sorted, you have to look at least once at every number in the file because it could be among the 10 largest. Therefore O(n) is the best you can achieve.

Some optimization is possible (however without changing the asymptotic complexity) by replacing the maxNum array with a min-heap. This will run faster if the count of numbers to be found is large enough (say you are looking for the 100 largest numbers). It will probably not yet pay off at 10.

edited Jan 15 '17 at 07:50

answered Jan 15 '17 at 07:43

Henry

42,982
7
68
84

Yupp, Thats true if the maxNumbers needed is more .. But for 10 numbers as you stated arrays would be much faster – AdityaReddy Jan 15 '17 at 07:53
I don't see any reason why not using `min heap` here, this implementation has *way* more costs for swapping operations, there's no need to keep them in sorted order, just do this *once* when polling from the top of heap. – Allen Jan 15 '17 at 08:14
@Xlee . .Sure . .will run some baseline tests and see the difference – AdityaReddy Jan 15 '17 at 09:00
This answer just says what OP is doing is fine, while it is not. In best scenario it can be a comment. In fact this answer should be deleted. – Saeed Amiri Jan 18 '17 at 08:09
@SaeedAmiri Obviously, I am of a different opinion. For a single threaded solution the OP was close to optimum. Using parallel processing as you suggest boils down to "buy a more powerful machine" and is faster just for this reason. – Henry Jan 18 '17 at 11:39
@Henry, I think even your PC, even your mobile phone has multiprocessor, any simple pc with support of only two parallel threads, with my approach solves the problem almost twice faster than your approach. It is not faster because of buying anything extra, it is faster because it is wiser and uses current hardware and current environment wiser instead of using old fashion text books which they were written in the time of punch cards. Certainly they are good sometimes but you should not repeat something like a recorded stream from 1970. – Saeed Amiri Jan 18 '17 at 11:43
@SaeedAmiri The limiting factor here is most likely I/O and not processing power, but I did not do any measurements. If this is the case, parallel processing will not bring that much. – Henry Jan 18 '17 at 12:35
@Henry, One can read the file in parallel. I mean e.g. read half of the file with one thread and another half with another thread, even IO these days provides some supports. I don't know much about java, but in C# parallelization is easy and one can simply test it. I even find a relevant question here: http://stackoverflow.com/questions/17188357/read-large-txt-file-multithreaded. People at that post tested the multithreading approach, it was about two times faster than single threaded one. BTW I don't know the underlying technology which IO uses to support this. It is maybe not just IO. – Saeed Amiri Jan 18 '17 at 12:43

Saeed Amiri · Answer 2 · 2017-01-15T10:31:05.930

You can improve the algorithm with multi threading and parallelization. It means run e.g. 20 threads, and partition the file into 20 files and in each part find the largest 10 numbers. At the end find the largest 10 numbers among those 20 arrays (each of length 10) that you maintained.

The point is that the operation is reading from the file or database not writing. So it should be possible to access different parts of the file via different threads in parallel. Even if your input was in memory this was faster than naive search. This is still O (n), but depending on number of threads which they operate in parallel (let say e.g. t), it uses about n/t comparisons. and it means it is about t times faster than a naive algorithm.

At the end I should say that bit optimization on the small array is useless as the main time and the main point is how to maintain a big file not maintaining a small array.

score 1 · Answer 3 · answered Jan 15 '17 at 08:30

1

In general, to find the K largest numbers from N numbers:

Sort the numbers in O(N lg N) time and then take the K largest. If you have billions of numbers on disk, you will have to do external (on-disk) sorting, such as external MergeSort.
Use a Min-Heap of capacity K and scan through the N values. Keep the K largest values in the heap, of which the smallest of those values is at the top. Running time: O(N lg K). You can keep the Min-heap in memory as you scan through the numbers from disk.
Use a selection algorithm to find the (N-K)th largest value in expected time O(N). The Quickselect algorithm that uses Quicksort's partition algorithm will also partition the values such that the K largest values are on one side of the (N-K)th largest. Expected running time: O(N). However, that selection algorithm is in-memory.

answered Jan 15 '17 at 08:30

stackoverflowuser2010

38,621
48
169
217

The answer is missing the main thing about dealing with big files and instead provides some wikipedia like information. Which are only general information. – Saeed Amiri Jan 18 '17 at 08:12
@SaeedAmiri: In each of the three points, I clearly mention how the algorithms can be applied to on-disk big data. – stackoverflowuser2010 Jan 18 '17 at 18:37
I mean the main point is to do it in parallel not just noraml sequential one. – Saeed Amiri Jan 18 '17 at 20:56
@SaeedAmiri: Where in the OP's post does he say that he has multiple computers for parallelism? Parallelism should not be first concern. Creating algorithms where data cannot fit into memory is more important. – stackoverflowuser2010 Jan 18 '17 at 21:51
OP doesn't need to say that, you should wake up, it is not 1970, even mobile phones have multiprocessors. You just repeated 1970 textbooks. Read my comments on the other answer. – Saeed Amiri Jan 19 '17 at 09:04
@SaeedAmiri: Apparently you are unable to read English appropriately. If the OP did not ask about parallelism, then you should not assume it. The main issue with OP's problem is on-disk versus in-memory data. – stackoverflowuser2010 Apr 07 '17 at 06:34
Your comment is too low level and irrelevant. You suggest never think out of the box. BTW, as I said, parallelism is one of the things people use very often in this cases. You should wake up. – Saeed Amiri Apr 07 '17 at 20:09

How to efficiently find 10 greatest numbers from billions of numbers?

3 Answers3