"Stable" k-largest elements algorithm

Question

Related: priority queue with limited space: looking for a good algorithm

I am looking for an algorithm which returns the k-largest elements from a list, but does not change the order of the k-largest elements, e.g. for k=4 and given 5,9,1,3,7,2,8,4,6, the algorithm should return 9,7,8,6.

More background, my input data are approximately 200 pairs (distance,importance) which are ordered w.r.t distance, and I need to select the 32 most important of them. Performance is crucial here, since I have to run this selection algorithm a few thousand times.

Up to now I have the two following ideas, but both of them seem not to be the best possible.

Remove the minimum element iteratively until 32 elements are left (i.e. do selection sort)
Use quickselect or median-of-medians to search for the 32nd largest element. Afterwards, sort the remaining 31 elements again w.r.t. distance.

I need to implement this in C++, so if anybody wants to write some code and does not know which language to use, C++ would be an option.

Why not use the standard heap/priority queue solution, but also keep track of which index each element came from, and then sort the result of that by index? — kaya3, Oct 06 '20 at 16:05
Is there a fixed range for values? If yes and its small count sort is an option. — Cherubim, Oct 06 '20 at 16:26
Note that sorting the remaining 32 elements would be faster than the first step of selecting these 32 elements — Damien, Oct 06 '20 at 16:38
@Cherubim, The `distance`s are integers between 1 and 300000 (it may possible that I have to use floating point for the distances, but I don´t think it will become necessary). The `importance`s are floating point between -1 and 200. Thus, count sort is not an option (Nevertheless, I didn't know that sorting algorithm yet, thanks for pointing me to it). — tommsch, Oct 06 '20 at 16:44

Deepak Tatyaji Ahire · Answer 1 · 2020-10-06T19:51:04.023

Inspired by @trincot's solution, I have come up with a slightly different variation with working implementation.

Algorithm

Use Floyd's algorithm to build the max heap or which is equivalent to the building priority_queue in C++ using the constructor in which we pass the entire array/vector at once, instead of adding elements individually. The max heap if built in O(N) time complexity.
Now, pop the items from max heap K-1 times until we get Kth Max Importance Item. Store the value of Kth Max Importance Item in variable Kth_Max_Importance_Item.
Scan all the nodes from original input whose importance value is greater than the importance value of Kth_Max_Importance_Item, and push them into output vector.
Calculate the left over count of required items with importance value equal to that of the importance value of Kth_Max_Importance_Item by subtracting the current size of output vector from k. Store it in variable left_Over_Count.
Scan left_Over_Count number of values of items from original input whose importance value if equal to importance value of Kth_Max_Importance_Item, and push them into output vector.

NOTE: If importance values are not unique, then this condition is taken care of by step 3 and 4 of the algorithm.

Time Complexity: O(N + K*log(N)). Assuming K<<N, then, Time Complexity ~ O(N).

Implementation:

#include <iostream>
#include <vector>
#include <queue>
#include <math.h>

typedef struct Item{

    int distance;
    double importance;

}Item;

struct itemsCompare{

    bool operator() (const Item& item1, const Item& item2){

        return ((item1.importance < item2.importance) ? true : false);
    }
};

bool compareDouble(const double& a, const double& b){

    return (fabs(a-b) < 0.000001) ? true : false;
}

int main(){

    //Original input
    std::vector<Item> items{{10, 2.1}, {9, 2.3}, {8, 2.2}, {7, 2.2}, {6, 1.5}};

    int k = 4;

    //Min Heap
    std::priority_queue<Item, std::vector<Item>, itemsCompare> maxHeap (items.begin(), items.end());

    //Checking if the order of original input is intact
    /*for(int i=0;i<items.size();i++){
        std::cout<<items[i].distance<<" "<<items[i].importance<<std::endl;
    }*/

    //Pulling the nodes until we get Kth Max Importance Node

    int count = 0;
    while(!maxHeap.empty()){
        
        if(count == k-1){
            break;
        }

        maxHeap.pop();
        count++;

    }

    Item Kth_Max_Importance_Item = maxHeap.top();

    //std::cout<<Kth_Max_Importance_Item.importance<<std::endl;


    //Scanning all the nodes from original input whose importance value is greater than the importance value of Kth_Max_Importance_Item.

    
    std::vector<Item> output;

    for(int i=0;i<items.size();i++){

        if(items[i].importance > Kth_Max_Importance_Item.importance){
            output.push_back(items[i]);
        }
    }
    
    int left_Over_Count = k - output.size();

    //std::cout<<left_Over_Count<<std::endl;

    //Adding left_Over_Count number of values of items whose importance value if equal to importance value of Kth_Max_Importance_Item

    for(int i=0;i<items.size();i++){

        if(compareDouble(items[i].importance, Kth_Max_Importance_Item.importance)){
            output.push_back(items[i]);
            left_Over_Count--;
        }

        if(!left_Over_Count){
            break;
        }
    }

    //Printing the output:

    for(int i=0;i<output.size();i++){

        std::cout<<output[i].distance<<" "<<output[i].importance<<std::endl;
    }

    return 0;
}

Output:

I have to accept trincots answer, since he came up with the solution - but I am grateful to you as well. Thanks. — tommsch, Oct 07 '20 at 08:59

score 3 · Accepted Answer · answered Oct 06 '20 at 17:05

3

Use the heap-based algorithm for finding the k largest value, i.e. use a min heap (not a max heap) that never exceeds a size of k. Once it exceeds that size, keep pulling the root from it to restore it to a size of k.

At the end the heap's root will be k largest value. Let's call it m.

You could then scan the original input again to collect all values that are at least equal to m. This way you'll have them in their original order.

When that m is not unique, you could have collected too many values. So check the size of the result and determine how much longer it is than k. Go backwards through that list and mark the ones that have value m as deleted until you have reached the right size. Finally collect the non-deleted items.

All these scans are O(n). The most expensive step is the first one: O(nlogk).

answered Oct 06 '20 at 17:05

trincot

317,000
35
244
286

won't making use of min-heap change the order of the original input? – Deepak Tatyaji Ahire Oct 06 '20 at 17:37
Not if you use separate memory of the min-heap. Note that this algorithm does not actually rely on the content of the heap once the k-th largest element has been identified. At that moment the heap can be discarded. – trincot Oct 06 '20 at 17:39
Excellent solution if space complexity does not matter @trincot – Deepak Tatyaji Ahire Oct 06 '20 at 17:52
I understand the idea, but don't get quite the use of the min-heap. If the last element I add to the min-heap would be the lowest number in my list, wouldn't be this element on the top of min-heap then, instead of the k-largest value? – tommsch Oct 06 '20 at 19:43
@tommsch, the catch is, not to push the elements inside the heap if the current top element of the heap is greater than the element that you will be pushing. This way, the top element will be the Kth Max Element. – Deepak Tatyaji Ahire Oct 06 '20 at 19:45
@DeepakTatyajiAhire Thanks, trincots solution is really a great idea. I just found https://stackoverflow.com/questions/2933758/priority-queue-with-limited-space-looking-for-a-good-algorithm which boils down to the very same problem and which gives the same answer. – tommsch Oct 06 '20 at 19:51

"Stable" k-largest elements algorithm

2 Answers2