I need to find N largest elements in a big collection of data.
I have:
- A collection of hundreds of millions items in external database (Cassandra)
A job that iterates through these items and finds item with largest value
Item largest = null; // Page through big data List<Item> items = getNextPage(pageSize); while (items.size() > 0) { // Update largest item based on values from current page for (Item current : items) { if (largest == null || largest.getValue() < current.getValue()) { largest = current; } } // Move to next page items = getNextPage(pageSize); }
I need:
- Extend this job to hold N (lets say 100) elements with highest value
My approach:
I was thinking about something like priority queue with fixed size
class PQsort implements Comparator<Item> { public int compare(Item one, Item two) { return two.getValue() - one.getValue(); } } PriorityQueue<Item> pq = new PriorityQueue<Item>(101, new PQsort()); ...while...for... pq.offer(current); if (pq.size() == 101) { // Remove the tail somehow } ...
Removing the tail: Removing tail element of priority queue
What is the optimal solution for this task?