37

What would be the best solution to find top N (say 10) elements in an unordered list (of say 100).

The solution which came in my head was to 1. sort it using quick sort, 2. get top 10.

But is there any better alternative?

Yin Zhu
  • 16,980
  • 13
  • 75
  • 117
zengr
  • 38,346
  • 37
  • 130
  • 192

13 Answers13

31

The time could be reduced to linear time:

  1. Use the selection algorithm, which effectively find the k-th element in a un-sorted array in linear time. You can either use a variant of quick sort or more robust algorithms.

  2. Get the top k using the pivot got in step 1.

Yin Zhu
  • 16,980
  • 13
  • 75
  • 117
  • 2
    This is a very nice way. But I'll mention that linear (deterministic) selection is actually pretty slow -- using quickselect with randomly chosen pivots is likely to be much faster on typical problem sizes, but it can take quadratic time. – j_random_hacker Nov 03 '10 at 07:10
10

How about delegating everything to Java ;)

function findTopN(Array list, int n)
{
    Set sortedSet<Integer> = new TreeSet<>(Comparators.naturalOrder());

    // add all elements from list to sortedSet

    // return the first n from sortedSet
}

I am not trying to say that this is the best way. I still think Yin Zhu's method of finding the kth largest element is the best answer.

nuaavee
  • 1,336
  • 2
  • 16
  • 31
  • 11
    TreeSet will remove duplicates which may not be desired. – Stefan Jan 18 '13 at 16:14
  • 1
    A simple and clean solution, but this will be O(nlogn) average/worst case compared to a selection algorithm (such as quickselect) which can select the top k elements in O(n) average case and o time where n is the size of the input array. – Pavel Feb 07 '14 at 01:22
9

If you're dealing with simple elements like fixed-length integers, then provided you can spare a memory buffer of the same size as the input data, sorting can be done in O(n) time using bucket or radix sorts, and this will be the fastest.

Although there are linear-time selection algorithms, the hidden constant is very high -- around 24. That means an O(nlog n) algorithm will be typically faster for fewer than several million elements.

Otherwise, in the general case when you can only compare 2 elements and determine which is greater, the problem is best solved by a heap data structure.

Suppose you want the top k of n items. All solutions based on fully sorting the data require O(nlog n) time, while using a heap requires only O(nlog k) time -- just build a heap on the first k elements, then keep adding an element and removing the maximum. This will leave you with a heap containing the smallest k elements.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
4

Yes, you can do it in O(n) by just keeping a (sorted) running list of the top N. You can sort the running list using the regular library functions or a sorting network. E.g. a simple demo using 3, and showing which elements in the running list change each iteration.

5 2 8 7 9

i = 0
top[0] <= 5

i = 1
top[1] <= 2

i = 2
top[2] <= top[1] (2)
top[1] <= top[0] (5)
top[0] <= 8

i = 3
top[2] <= top[1] (5)
top[1] <= 7

i = 4
top[2] <= top[1] (7)
top[1] <= top[0] (8)
top[0] <= 9
Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
  • 1
    This is good if the number of selected items is small. For selecting something like 10000 top elements from a million this is no longer optimal. – Rafał Dowgird Nov 03 '10 at 08:22
  • This is somehow like [Insertion sort](http://en.wikipedia.org/wiki/Insertion_sort). – Gumbo Nov 03 '10 at 08:37
4

The best solution is to use whatever facilities your chosen language provides which will make your life easier.

However, assuming this was a question more related to what algorithm you should choose, I'm going to suggest a different approach here. If you're talking about 10 from 100, you shouldn't generally worry too much about performance unless you want to do it many times per second.

For example, this C code (which is about as inefficient as I can make it without being silly) still takes well under a tenth of a second to execute. That's not enough time for me to even think about going to get a coffee.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SRCSZ 100
#define DSTSZ 10

int main (void) {
    int unused[SRCSZ], source[SRCSZ], dest[DSTSZ], i, j, pos;

    srand (time (NULL));
    for (i = 0; i < SRCSZ; i++) {
        unused[i] = 1;
        source[i] = rand() % 1000;
    }

    for (i = 0; i < DSTSZ; i++) {
        pos = -1;
        for (j = 0; j < SRCSZ; j++) {
            if (pos == -1) {
                if (unused[j]) {
                    pos = j;
                }
            } else {
                if (unused[j] && (source[j] > source[pos])) {
                    pos = j;
                }
            }
        }
        dest[i] = source[pos];
        unused[pos] = 0;
    }

    printf ("Source:");
    for (i = 0; i < SRCSZ; i++) printf (" %d", source[i]);
    printf ("\nDest:");
    for (i = 0; i < DSTSZ; i++) printf (" %d", dest[i]);
    printf ("\n");

    return 0;
}

Running it through time gives you (I've formatted the output a bit to make it readable, but haven't affected the results):

Source: 403 459 646 467 120 346 430 247 68 312 701 304 707 443
        753 433 986 921 513 634 861 741 482 794 679 409 145 93
        512 947 19 9 385 208 795 742 851 638 924 637 638 141
        382 89 998 713 210 732 784 67 273 628 187 902 42 25
        747 471 686 504 255 74 638 610 227 892 156 86 48 133
        63 234 639 899 815 986 750 177 413 581 899 494 292 359
        60 106 944 926 257 370 310 726 393 800 986 827 856 835
        66 183 901
Dest: 998 986 986 986 947 944 926 924 921 902

real    0m0.063s
user    0m0.046s
sys     0m0.031s

Only once the quantities of numbers become large should you usually worry. Don't get me wrong, I'm not saying you shouldn't think about performance. What you shouldn't do is spend too much time optimising things that don't matter - YAGNI and all that jazz.

As with all optimisation questions, measure don't guess!

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • but is it a suitable interview answer?? – zengr Nov 03 '10 at 06:32
  • 1
    Yes, I believe it would be. By far the most dangerous person to hire is the over-engineer. This is someone who, when asked to provide a function to calculate the sum of a list of items, will give you a 900-line monstrosity that's configurable to handle integers, floats, complex numbers and can also perform pre-calculations on each item with a callback function. This is _not_ the sort of person you want working on something that has a definite delivery date :-) In addition, it gives valuable insight that the candidate can think for themself. Any monkey can cut code. – paxdiablo Nov 03 '10 at 06:37
  • 2
    I would hire someone like that on the spot, provided they explain why they did it that way and what are the consequences. See also http://stackoverflow.com/questions/903572/consequences-of-doing-good-enough-software/903609#903609 and http://stackoverflow.com/questions/445425/what-algorithms-should-every-developer-know/445554#445554 although there were ... shall we say, animated ... discussions on the merits :-) – paxdiablo Nov 03 '10 at 06:40
  • 1
    Hehe, +1 for practicality... But re: what kind of answer is "best" I think the crucial thing is for the interviewee to ask: (1) What are the typical problem sizes? (2) How fast does it need to be? (And BTW if performance on big problems isn't important then the best answer is "Use whatever built-in `sort()` the language has :-P) – j_random_hacker Nov 03 '10 at 07:17
3

You can use List and can guava's Comparators class to get the desired results. It is a highly optimized solution. Please see a sample below, which gets top 5 numbers. Api can be found here.

import java.util.Comparator;
import java.util.List;
import java.util.stream.Collector;

import org.junit.Test;

import com.google.common.collect.Comparators;
import com.google.common.collect.Lists;

public class TestComparator {

    @Test
    public void testTopN() {
        final List<Integer> numbers = Lists.newArrayList(1, 3, 8, 2, 6, 4, 7, 5, 9, 0);
        final Collector<Integer, ?, List<Integer>> collector = Comparators.greatest(5,
                Comparator.<Integer>naturalOrder());
        final List<Integer> top = numbers.stream().collect(collector);
        System.out.println(top);
    }

}

Output: [9, 8, 7, 6, 5]

Pritesh Mhatre
  • 3,847
  • 2
  • 23
  • 27
2

Well, you can create a heap from an unsorted array in O(n) time, and you can get the top element from the heap in O(log(n)) time. So your total runtime is O(n + k*log(n)).

Charles Munger
  • 1,417
  • 12
  • 11
1

Written below both selection sort and insertion sort implementations. For larger data set I suggest insetion sort better than selection sort

public interface FindTopValues
{
  int[] findTopNValues(int[] data, int n);
}

Insertion Sort Implementation:

public class FindTopValuesInsertionSortImpl implements FindTopValues {  

/**
 * Finds list of the highest 'n' values in the source list, ordered naturally, 
 * with the highest value at the start of the array and returns it 
 */
@Override
public int[] findTopNValues(int[] values, int n) {

    int length = values.length;
    for (int i=1; i<length; i++) {
        int curPos = i;
        while ((curPos > 0) && (values[i] > values[curPos-1])) {
            curPos--;
        }

        if (curPos != i) {
            int element = values[i];
            System.arraycopy(values, curPos, values, curPos+1, (i-curPos));
            values[curPos] = element;
        }
    }       

    return Arrays.copyOf(values, n);        
}   

}

Selection Sort Implementation:

public class FindTopValuesSelectionSortImpl implements FindTopValues {

/**
 * Finds list of the highest 'n' values in the source list, ordered naturally, 
 * with the highest value at the start of the array and returns it 
 */
@Override
public int[] findTopNValues(int[] values, int n) {
    int length = values.length;

    for (int i=0; i<=n; i++) {
        int maxPos = i;
        for (int j=i+1; j<length; j++) {
            if (values[j] > values[maxPos]) {
                maxPos = j;
            }
        }

        if (maxPos != i) {
            int maxValue = values[maxPos];
            values[maxPos] = values[i];
            values[i] = maxValue;
        }           
    }
    return Arrays.copyOf(values, n);        
}
}
Ashish Chopra
  • 1,413
  • 9
  • 23
0

I was asked for the same algorithm on the interview. I done that, if somebody can compare that with fastest algorithm in Java - will be very useful.

    public int[] findTopNValues(int[] anyOldOrderValues, int n) {
        if (n < 0) {
            return new int[]{};
        }
        if (n == 1) {
            return new int[]{findMaxValue(anyOldOrderValues)};
        }

        int[] result = new int[n + 1];
        for (int i = 0; i < Math.min(n, anyOldOrderValues.length); i++) {
            result[i] = anyOldOrderValues[i];
        }
        Arrays.sort(result);

        int max = result[0];
        for (int i = n - 1; i < anyOldOrderValues.length; i++) {
            int value = anyOldOrderValues[i];
            if (max < value) {
                result[n] = value;
                Arrays.sort(result);
                int[] result1 = new int[n + 1];
                System.arraycopy(result, 1, result1, 0, n);
                result = result1;
                max = result[0];
            }
        }
        return convertAndFlip(result, n);
    }

    public static int[] convertAndFlip(int[] integers, int n) {
        int[] result = new int[n];
        int j = 0;
        for (int i = n - 1; i > -1; i--) {
            result[j++] = integers[i];
        }
        return result;
    }

and test for that:

public void testFindTopNValues() throws Exception {
    final int N = 100000000;
    final int MAX_VALUE = 100000000;
    final int returnArray = 1000;
    final int repeatTimes = 5;

    FindTopValuesArraySorting arraySorting = new FindTopValuesArraySorting();

    int[] randomArray = createRandomArray(N, MAX_VALUE);
    for (int i = 0; i < repeatTimes; i++) {

        long start = System.currentTimeMillis();
        int[] topNValues = arraySorting.findTopNValues(randomArray, returnArray);
        long stop = System.currentTimeMillis();

        System.out.println("findTopNValues() from " + N + " elements, where MAX value=" + (MAX_VALUE - 1) + " and return array size " + returnArray + " elements : " + (stop - start) + "msec");
        // System.out.println("Result list = " + Arrays.toString(topNValues));
    }
}

private static int[] createRandomArray(int n, int maxValue) {
    Random r = new Random();
    int[] arr = new int[n];
    for (int i = 0; i < n; i++) {
        arr[i] = r.nextInt(maxValue);
    }
    return arr;
}

Result is something like:

findTopNValues() from 100000000 elements, where MAX value=99999999 and return array size 1000 elements : 395msec
findTopNValues() from 100000000 elements, where MAX value=99999999 and return array size 1000 elements : 311msec
findTopNValues() from 100000000 elements, where MAX value=99999999 and return array size 1000 elements : 473msec
findTopNValues() from 100000000 elements, where MAX value=99999999 and return array size 1000 elements : 380msec
findTopNValues() from 100000000 elements, where MAX value=99999999 and return array size 1000 elements : 406msec

~400msc average result, for getting 1000 max integers from array of 100.000.000 initial elements. not bad!

Just tried that set from above:

findTopNValues() from 101 elements and return array size 10 elements : 1msec
Result list = [998, 986, 986, 986, 947, 944, 926, 924, 921, 902]
Original list = [403, 459, 646, 467, 120, 346, 430, 247, 68, 312, 701, 304, 707, 443, 753, 433, 986, 921, 513, 634, 861, 741, 482, 794, 679, 409, 145, 93, 512, 947, 19, 9, 385, 208, 795, 742, 851, 638, 924, 637, 638, 141, 382, 89, 998, 713, 210, 732, 784, 67, 273, 628, 187, 902, 42, 25, 747, 471, 686, 504, 255, 74, 638, 610, 227, 892, 156, 86, 48, 133, 63, 234, 639, 899, 815, 986, 750, 177, 413, 581, 899, 494, 292, 359, 60, 106, 944, 926, 257, 370, 310, 726, 393, 800, 986, 827, 856, 835, 66, 183, 901]
Dmitri Algazin
  • 3,332
  • 27
  • 30
0

The best Algorithm would by large depend on the size of K. If K is small then by simply following BubbleSort Algorithm and iterating the outer loop K times would give the top K values. The complexity will be O(n*k).

However for values of K close to n the complexity will approach O(n^2). In such scenario quicksort might be a good alternative.

  • BubbleSort itself has O(nlogn). So the complexity will not be O(n*k). more orders don't depend on the size of the factors i.e n and k here! – Anand Mattikopp Apr 16 '20 at 11:32
0
public class FindTopValuesSelectionSortImpl implements FindTopValues {

/**
 * Finds list of the highest 'n' values in the source list, ordered naturally, 
 * with the highest value at the start of the array and returns it 
 */
@Override
public int[] findTopNValues(int[] values, int n) {
    int length = values.length;

    for (int i=0; i<=n; i++) {
        int maxPos = i;
        for (int j=i+1; j<length; j++) {
            if (values[j] > values[maxPos]) {
                maxPos = j;
            }
        }

        if (maxPos != i) {
            int maxValue = values[maxPos];
            values[maxPos] = values[i];**strong text**
            values[i] = maxValue;
        }           
    }
    return Arrays.copyOf(values, n);        
}
}
HARISH
  • 1
  • 3
0

Yes there is a way to do better than quicksort. As pointed by Yin Zhu, you can search for kth largest element first and then use that element value as your pivot to split the array

CodeKata
  • 572
  • 4
  • 8
0

find-top-n-elements-in-an-array can be solved with below technique Let's say array length is m

  1. Using 2 loops like Bubble sort - O(m^2) 2 loops

  2. Finding pivot at Nth position( Quick sort) -- finding pivot at nth location but worst case complexity is O(MLogM) and might lead to O(M^2)

  3. Heap - Heap is very useful data-structure for such requirement like getKthMax, getKthMin, getTopN, getBottomN and so on.. Heap can be max heap or min heap and as per requirement one of them can be used.

In current scenario MinHeap makes more sense as minimum number will be always on top, with below steps to solve

  • Iterate over elements in the array
  • add element in heap
  • if heap size > n then pop one element from heap so any time heap will have at most n elements in the heap
  • Iterate over heap, and collect all element in collection

Time complexity: m - size of array, n is top n element O(MlogN) -- heap add and remove takes logN time and we are doing this for all element in the array Space complexity O(N)

// Sample Java code

public List<Integer> getTopNElements(int[] arr, int n){
   List<Integer> topNList = new ArrayList<>();
   if(arr==null || arr.length <1 || n<1) return topNList;
   
   PriorityQueue<Integer> heap = new PriorityQueue<>(); // default MinHeap
   for(int elem: arr){
      heap.offer(elem);
      if(heap.size() >n) heap.poll();
   }

   while(!heap.isEmpty()){
      topNList.add(heap.poll());
   }
   return topNList;
}

Hope this helps.

Sagar balai
  • 479
  • 6
  • 13