3

I need to find N largest elements in a big collection of data.

I have:

  • A collection of hundreds of millions items in external database (Cassandra)
  • A job that iterates through these items and finds item with largest value

    Item largest = null;
    
    // Page through big data
    List<Item> items = getNextPage(pageSize);
    while (items.size() > 0) {
    
      // Update largest item based on values from current page
      for (Item current : items) {
        if (largest == null || largest.getValue() < current.getValue()) {
          largest = current;
        }
      }
    
      // Move to next page
      items = getNextPage(pageSize);
    }
    

I need:

  • Extend this job to hold N (lets say 100) elements with highest value

My approach:

  • I was thinking about something like priority queue with fixed size

    class PQsort implements Comparator<Item> {
    
      public int compare(Item one, Item two) {
        return two.getValue() - one.getValue();
      }
    }
    
    PriorityQueue<Item> pq = new PriorityQueue<Item>(101, new PQsort());
    
    ...while...for...
    
    pq.offer(current);
    if (pq.size() == 101) {
      // Remove the tail somehow 
    }
    ...
    
  • Removing the tail: Removing tail element of priority queue

What is the optimal solution for this task?

Community
  • 1
  • 1
Michal
  • 1,955
  • 5
  • 33
  • 56
  • 3
    Why dont you let the database do the work? – Gyro Gearless Mar 14 '17 at 13:16
  • 1
    @Gyro Gearless specified value (a column corresponding to it) is not a clustering key, thus, i am not able to sort by it – Michal Mar 14 '17 at 13:20
  • @Michal, are you sure sorting it is the answer? maybe you could simply select all the rows that have the max value `SELECT * FROM table WHERE col = (SELECT MAX(col) FROM table)` and then count them – cristianhh Mar 14 '17 at 13:22
  • 1
    @cristianhh Cassandra does not support nested queries – Michal Mar 14 '17 at 13:25
  • @Michal understood, in that case see if you can find any help here: http://stackoverflow.com/questions/17342176/max-distinct-and-group-by-in-cassandra – cristianhh Mar 14 '17 at 13:29

5 Answers5

7

A couple of thoughts on this.

This task is well suited for using multiple processors. You could split the pages across a pool of threads and then combine the results as they complete.

It's unnecessary to insert each value, allow the collection to sort and then remove the smallest. Quicker would be to just check if each item is larger than the smallest (i.e. the last) item in the collection.

Here's a simple example finding the 100 largest integers in an array of 10,000 random integers.

    Queue<Integer> largest = new PriorityQueue<>(100);

    for (int item : new Random().ints(10000, 0, 100).toArray()) {
        if (largest.size() < 100 || largest.peek() < item) {
            if (largest.size() == 100)
                largest.remove();
            largest.add(item);
        }
    }
    System.out.println(largest);
sprinter
  • 27,148
  • 6
  • 47
  • 78
  • Why this condition: largest.get(0) < 0 ? If the size of largest(N) converges to size of items, this could run in quadratic time, if it would even work. The condition will stop processing on first insert. – John Mar 14 '17 at 15:25
  • Sorry should have been `< item`. In other words there's no need to insert the item if it's already smaller than the smallest in the list. – sprinter Mar 14 '17 at 22:15
  • Correct me if I'm wrong, but I think the worst case (e.q. an ascending sequence of numbers) running time of your algorithm is `O(n)` for inserts, while a heap provides `O(logn)`. – csirmazbendeguz Mar 14 '17 at 22:29
  • 1
    @csirmazbendeguz yes you're correct a heap would be faster. But OP is talking about inserting 100 items into the collection while checking tens of millions of items against the smallest item. The speed of insertion will be insignificant against the cost of iterating through the items and performing the checks against the minimum. – sprinter Mar 15 '17 at 02:37
  • @csirmazbendeguz I switched to a priority queue - I doubt it'd make much difference to performance but it makes the code clearer. – sprinter Mar 15 '17 at 02:45
3

I'd stick with the PriorityQueue and just remove items when it is larger than necessary.

public static void main(String[] args) {
    List<Integer> list = Arrays.asList(1, 10, 2, 9, 3, 7, 4, 6, 5, 7, 7, 7);

    findNHighest(list, 3);
    findNHighest(list, 1);
    findNHighest(list, 4);

}

private static void findNHighest(List<Integer> list, int n) {
    Queue<Integer> nthHighest = new PriorityQueue<>();

    for (Integer each : list) {
        nthHighest.add(each);
        if (nthHighest.size() > n) {
            nthHighest.poll();
        }
    }
    System.out.println(nthHighest);
}

Output

[7, 9, 10]
[10]
[7, 7, 9, 10]
Adam
  • 35,919
  • 9
  • 100
  • 137
1

A SortedSet implementation can be used to do the job:

class PQsort implements Comparator<Item> {

  public int compare(Item one, Item two) {
    return two.getValue() - one.getValue();
  }
}

...

Comparator<Item> itemComparator = new PQSort();
SortedSet<Item> top100 = new TreeSet<Item>(100, itemComparator);
Item smallestOfTheTop100 = null;

// Page through big data
List<Item> items = getNextPage(pageSize);
while (items.size() > 0) {

  for (Item current : items) {
    if (smallestOfTheLargest == null || itemComparator.compare(smallestOfTheTop100, current) > 0) {
         top100.add(item); // the current item is larger than the end of our top 100 list, so add it to the set.
         top100.remove(top100.first()); // remove the 101th element of the set - it is now extra. 
         smallestOfTheTop100 = top100.first(); 
    }
  }

  // Move to next page
  items = getNextPage(pageSize);
}

As sprinter says in his answer, it could be also reworked in a parallel implementation - e.g. using Streams.

david a.
  • 5,283
  • 22
  • 24
1

To build priority queue it will take MlogM where M is a total number of items, and then to pop N items it will take additional NlogM. That is a bit more expensive then to sort array with MlogM and then select last N items in N.

If the N is small, simply iterate the array N times, each time taking next best maximum.

Standard solution would be Quick Select with average linear time, here is the implementation by Profesor Robert Sedgewick. If you need 100 largest items, select largest 100th element, all the elements to the right of the item will be greater. The profesor has a nice video lecture on the topic.

Relevant part:

/***************************************************************************
*  Rearranges the elements in a so that a[k] is the kth smallest element,
*  and a[0] through a[k-1] are less than or equal to a[k], and
*  a[k+1] through a[n-1] are greater than or equal to a[k].
***************************************************************************/
public static <Key extends Comparable<Key>> Key select(Key[] a, int k) {
    if (k < 0 || k >= a.length) {
        throw new IndexOutOfBoundsException("Selected element out of bounds");
    }
    StdRandom.shuffle(a);
    int lo = 0, hi = a.length - 1;
    while (hi > lo) {
        int i = partition(a, lo, hi);
        if      (i > k) hi = i - 1;
        else if (i < k) lo = i + 1;
        else return a[i];
    }
    return a[lo];
}
John
  • 5,189
  • 2
  • 38
  • 62
  • OP mentions hundreds of millions of items. To insert them all into a priority queue when only a small number are required (example given is N = 100) is pretty wasteful. – sprinter Mar 14 '17 at 22:21
  • I agree with you, although i never really mentioned that the OP should insert items into priority queue. – John Mar 14 '17 at 23:33
0

I would replace your largest with a List<Item>. In your loop you could then do something like:

largest.add(current);
bubbleSort(largest);
if ( largest.size() > 100 ) {
  largest.remove(0);
}

By using a bubble sort you can maintain O(n) complexity because one of the features of bubble sort is that if there is only one entry out of place it performs in O(n) time.

I leave it to student to implement bubbleSort.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213