20

In SQL there is the feature to say something like

SELECT TOP 20 distance FROM dbFile ORDER BY distance ASC

If my SQL is correct with, say 10,000 records, this should return the 20 smallest distances in my databse.

I don't have a database. I have a 100,000-element simple array.

Is there a C++ container, Boost, MFC or STL that provides simple code for a struct like

struct closest{
    int    ID;
    double distance;
    closest():ID(-1), distance(std::numeric_limits<double>::max( )){}
};

Where I can build a sorted by distance container like

boost::container::XXXX<closest> top(20);

And then have a simple

top.replace_if(closest(ID,Distance));

Where the container will replace the entry with the current highest distance in my container with my new entry if it is less than the current highest distance in my container.

I am not worried about speed. I like elegant clean solutions where containers and code do all the heavy lifting.

EDIT. Addendum after all the great answers received.

What I really would of liked to have found, due to its elegance. Is a sorted container that I could create with a container size limit. In my case 20. Then I could push or insert to my hearts content a 100 000 items or more. But. There is always a but. The container would of maintained the max size of 20 by replacing or not inserting an item if its comparator value was not within the lowest 20 values.

Yes. I know now from all these answers that via programming and tweaking existing containers the same effect can be achieved. Perhaps when the next round of suggestions for the C & C++ standards committee sits. We could suggest. Self sorting (which we kind of have already) and self size limiting containers.

kingchris
  • 1,677
  • 22
  • 29
  • 3
    like `std::priority_queue` ? – Mohammad Dec 20 '15 at 07:12
  • @Mohammad May I ask how a priority queue would help. I will google it for now. Thanks – kingchris Dec 20 '15 at 07:15
  • @Mohammad Any queue adaptor will only give you the first element. If you need more than one, you need a different data structure. – juanchopanza Dec 20 '15 at 07:23
  • @kingchris How do you define "distance"? – juanchopanza Dec 20 '15 at 07:24
  • @juanchopanza I was going for poping top element n times. – Mohammad Dec 20 '15 at 07:26
  • @juanchopanza. Its a Vincenty calculated value of latitude and longitude points in relation to each other. – kingchris Dec 20 '15 at 07:27
  • @kingchris, I have done the priority queue method exact one in python to find the largest and oldest files in filesystem. I left you the approach – ForeverStudent Dec 20 '15 at 07:31
  • 1
    I'm not sure, If I completely understand your question: Do you already have all elements and you want to find the 20 with the smallest distance, or are they coming in one by one and you want to hold only the 20 smallest and discard the rest? In the first case, you could just (partially) sort your array and the whole thing boils down to a one-liner. – MikeMB Dec 20 '15 at 09:18
  • Oh, and should the result be ordered? – MikeMB Dec 20 '15 at 11:48
  • 3
    @MikeMB I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points. – kingchris Dec 20 '15 at 12:08
  • in C++ you can easily implement the "automatic size checking" by creating a wrapper class that implements my solution and abstracts all the nitty gritty while giving you the "pushPopSizeCheck()" – ForeverStudent Dec 29 '15 at 18:37
  • @ForeverStudent Could you point somewhere that this may have been done. Easily implement for someone who is not familiar with the inner workings of the std lib could evolve into several hours of work. Often I wish to just use the code without having to override and rewriting it unless I have to. – kingchris Dec 30 '15 at 04:20
  • @kingchris There is no reason to override and rewrite code from standard namespace. All you need to do is to create a wrapper function that keeps the information internally and abstracts away the implementation details. I am going to update my answer and give you a toy example. – ForeverStudent Dec 30 '15 at 17:33
  • @ForeverStudent Thanks that would be helpful. – kingchris Dec 31 '15 at 04:13
  • @kingchris look at my answer, it has C++ code that gives you the idea of what to do. as you can see all the detail is abstracted in the smallestElements class. performance overhead is low and everything is encapsulated. – ForeverStudent Dec 31 '15 at 04:16

9 Answers9

22

What you need is to have a maxheap of size 20. Recall that the root of your heap will be the largest value in the heap.

This heap will contain the records with smallest distance that you have encountered so far. For the first 20 out of 10000 values you just push to the heap.

At this point you iterate through the rest of the records and for each record, you compare it to the root of your heap.

Remember that the root of your heap is basically the very worst of the very best.(The record with the largest distance, among the 20 records with the shortest distance you have encountered so far).

If the value you are considering is not worth keeping (its distance is larger that the root of your tree), ignore that record and just keep moving.

Otherwise you pop your heap (get rid of the root) and push the new value in. The priority queue will automatically put its record with the largest distance on the root again.

Once you keep doing this over the entire set of 10000 values, you will be left with the 20 records that have the smallest distance, which is what you want.

Each push-pop takes constant O(1) time, iterating through all inputs of N is O(n) so this is a Linear solution.

Edit: I thought it would be useful to show my idea in C++ code. This is a toy example, you can write a generic version with templates but I chose to keep it simple and minimalistic:

#include <iostream>
#include <queue>
using namespace std;
class smallestElements
{
private:
    priority_queue<int,std::vector<int>,std::less<int> > pq;
    int maxSize;
public:
    smallestElements(int size): maxSize(size)
    {
    pq=priority_queue<int, std::vector<int>, std::less<int> >();
    }
    void possiblyAdd(int newValue)
    {
        if(pq.size()<maxSize)
        {
            pq.push(newValue);
            return;
        }
        if(newValue < pq.top())
        {
            pq.pop(); //get rid of the root
            pq.push(newValue); //priority queue will automatically restructure
        }
    }
    void printAllValues()
    {
        priority_queue<int,std::vector<int>,std::less<int> > cp=pq;
        while(cp.size()!=0)
        {
            cout<<cp.top()<<" ";
            cp.pop();
        }
        cout<<endl;
    }
};

How you use this is really straight forward. basically in your main function somewhere you will have:

smallestElements se(20); //we want 20 smallest
//...get your stream of values from wherever you want, call the int x
se.possiblyAdd(x); //no need for bounds checking or anything fancy
//...keep looping or potentially adding until the end

se.printAllValues();//shows all the values in your container of smallest values
// alternatively you can write a function to return all values if you want
ForeverStudent
  • 2,487
  • 1
  • 14
  • 33
  • Ok. This looks promising, Can C++ heaps be size limited? – kingchris Dec 20 '15 at 07:34
  • look in documentation for std::priority_queue, if the size is not automatically limited you can always check the size property yourself – ForeverStudent Dec 20 '15 at 07:35
  • O(log 20) = O(1), it is a linear solution !? – Siphor Dec 20 '15 at 10:35
  • 2
    yes but with the O(n) iterating it is not close to a linear solution, but it is a linear solution. – Siphor Dec 20 '15 at 12:03
  • wether it is O(nlogn) or O(n) depends on if the value 20 is constant or a function of 10000 – ForeverStudent Dec 20 '15 at 14:10
  • This is a 100% linear solution. 20 is a constant. Also you could do it the other way round. Build a min heap with all the values, which is again linear, and pop the minimum 20 times, which is again constant time. Our use the algorithm for order statistic to find the 20th element and scab the array to find all those smaller than it. This is linear even if 20 isn't a constant if you use the right algorithm – Bakuriu Dec 20 '15 at 15:50
  • [`std::make_heap`](http://en.cppreference.com/w/cpp/algorithm/make_heap) and the other two. Also, it isn't close to a linear solution, it is one. – Deduplicator Dec 20 '15 at 21:35
  • 1
    @ForeverStudent Excellent. You have nailed it. Thanks for all the extra effort. I now grok it. – kingchris Dec 31 '15 at 14:44
  • @kingchris glad to have helped – ForeverStudent Jan 01 '16 at 21:39
11

If this is about filtering the 20 smallest elements from a stream on the fly, then a solution based on std::priority_queue (or std::multiset) is the way to go.

However, if it is about finding the 20 smallest elements in a given arraym I wouldn't go for a special container at all, but simply the algorithm std::nth_element - a partial sorting algorithm that will give you the n smallest elements - EDIT: or std::partial_sort (thanks Jarod42) if the elements also have to be sorted. It has linear complexity and it's just a single line to write (+ the comparison operator, which you need in any case):

#include <vector>
#include <iostream>
#include <algorithm>

struct Entry {
    int ID;
    double distance;    
};

std::vector<Entry> data;    

int main() {
    //fill data;

    std::nth_element(data.begin(), data.begin() + 19, data.end(), 
        [](auto& l, auto& r) {return l.distance < r.distance; });

    std::cout << "20 elements with smallest distance: \n"; 
    for (size_t i = 0; i < 20; ++i) {
        std::cout << data[i].ID << ":" << data[i].distance << "\n";
    }
    std::cout.flush();
}

If you don't want to change the order of your original array, you would have to make a copy of the whole array first though.

MikeMB
  • 20,029
  • 9
  • 57
  • 102
7

My first idea would be using a std::map or std::set with a custom comparator for this (edit: or even better, a std::priority_queue as mentioned in the comments).

Your comparator does your sorting.

You essentially add all your elements to it. After an element has been added, check whether there are more than n elements inside. If there are, remove the last one.

Mario
  • 35,726
  • 5
  • 62
  • 78
  • This might work. Add elements until size = 21 then delete highest distance entry. Add another element, remove highest. Rinse, repeat. – kingchris Dec 20 '15 at 07:24
  • 1
    This is overkill. You don't need to sort *every* element to get the top 20. You also don't need a node-based data structure. – juanchopanza Dec 20 '15 at 07:41
  • @juanchopanza care sharing your solution? You will have to compare all your elements unless they're sorted already. – Mario Dec 20 '15 at 07:48
  • @Mario You need to compare all of them, but you don't need for all of them to be sorted. A heap would probably be a good solution, but it depends on what OP really needs, which isn't completely clear from the question. – juanchopanza Dec 20 '15 at 07:51
  • BTW, I am talking about `set` and `map`. `priority_queue` would be OK (uses a heap internally) , except that one doesn't have access to the top 20 elements. – juanchopanza Dec 20 '15 at 07:52
6

I am not 100% sure, that there is no more elegant solution, but even std::set is pretty pretty.

All you have to do is to define a proper comparator for your elements (e.g. a > operator) and then do the following:

std::set<closest> tops(arr, arr+20)
tops.insert(another);
tops.erase(tops.begin());
Marandil
  • 1,042
  • 1
  • 15
  • 31
  • A set doesn't allow multiple elements with the same distance. I'd recommend multiset instead – MikeMB Dec 20 '15 at 10:49
  • @Marandil The priority queue worked ok. Even though it was s difficult beast to then get the data out of. – kingchris Dec 21 '15 at 05:40
5

I would use nth_element like @juanchopanza suggested before he deleted it.

His code looked like:

bool comp(const closest& lhs, const closest& rhs)
{
    return lhs.distance < rhs.distance;
}

then

std::vector<closest> v = ....;
nth_element(v.begin(), v.begin() + 20, v.end(), comp);

Though if it was only ever going to be twenty elements then I would use a std::array.

graham.reeds
  • 16,230
  • 17
  • 74
  • 137
  • `std::partial_sort` instead of `nth_element` if you want the resulted elements sorted. – Jarod42 Dec 20 '15 at 10:11
  • Thanks. Currently it is 20 elements. But this might change in the near future. But this will work. – kingchris Dec 21 '15 at 05:38
  • @Jarod42: It would be quicker to use `nth_element` (linear complexity) and sort those than to use `partial_sort` (logarithmic complexity). – graham.reeds Dec 21 '15 at 10:43
  • @graham.reeds: They have same complexity: `O(N)+O(n20 log n20)` vs `O(N log n20)`. – Jarod42 Dec 21 '15 at 11:43
  • Moreover, I would expect than a **dedicated** method is not worse than this trivial implementation in 2 steps. – Jarod42 Dec 21 '15 at 11:46
4

Just so you can all see what I am currently doing which seems to work.

struct closest{
    int base_ID;
    int ID;
    double distance;

    closest(int BaseID, int Point_ID, 
    double Point_distance):base_ID(BaseID), 
    ID(Point_ID),distance(Point_distance){}
    closest():base_ID(-1), ID(-1),
    distance(std::numeric_limits<double>::max( )){}

    bool operator<(const closest& rhs) const
    {
        return distance < rhs.distance;
    }
};



void calc_nearest(void) 
{ 
    boost::heap::priority_queue<closest> svec;

    for (int current_gift = 0; current_gift < g_nVerticesPopulated; ++current_gift)
    {   double best_distance=std::numeric_limits<double>::max();    
        double our_distance=0.0;
        svec.clear();

        for (int all_other_gifts = 0; all_other_gifts < g_nVerticesPopulated;++all_other_gifts)
        {   
            our_distance = distanceVincenty(g_pVertices[current_gift].lat,g_pVertices[current_gift].lon,g_pVertices[all_other_gifts].lat,g_pVertices[all_other_gifts].lon);
            if (our_distance != 0.0)
            {
                if (our_distance < best_distance) // don't bother to push and sort if the calculated distance is greater than current 20th value
                    svec.push(closest(g_pVertices[current_gift].ID,g_pVertices[current_gift].ID,our_distance));

                if (all_other_gifts%100 == 0)
                {
                    while (svec.size() > no_of_closest_points_to_calculate) svec.pop(); // throw away any points above no_of_closest_points_to_calculate
                    closest t =  svec.top(); // the furthest of the no_of_closest_points_to_calculate points for optimisation
                    best_distance = t.distance;
                }
            }
        }
        std::cout << current_gift << "\n";
    }
}

As you can see. I have 100 000 lat & long points draw on an openGl sphere. I am calculating each point against every other point and only retaining currently the closest 20 points. There is some primitive optimisation going on by not pushing a value if it is bigger than the 20th closest point.

As I am used to Prolog taking hours to solve something I am not worried about speed. I shall run this overnight.

Thanks to all for your help. It is much appreciated. Still have to audit the code and results but happy that I am moving in the right direction.

kingchris
  • 1,677
  • 22
  • 29
  • May I ask what this is for? – graham.reeds Dec 21 '15 at 10:45
  • Its for the Kaggle Santa Competition which is a traveling salesman problem with the constraint that his sleigh can only carry 1000 kgs per trip. For Prolog to solve this I need a list per destination of nearest destinations to choose the next delivery address. – kingchris Dec 21 '15 at 12:27
  • @graham.reeds. Sorry. Forgot to reference your name when replying to your question. See comment currently above. – kingchris Dec 21 '15 at 13:01
4

I have posted a number of approaches to the similar problem of retrieving the top 5 minimum values recently here:

There are implementations that keep a specific number of smallest or greatest items from an input vector in different ways. The nth_element algorithm performs a partial sort, the priority queue maintains a heap, the set a binary search tree, and the deque- and vector-based approaches just remove an element based on a (linear) min/max search.

It should be fairly easy to implement a custom comparison operator and to adapt the number of items to keep n.

Here's the code (refactored based off the other post):

#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>
#include <random>
#include <iostream>
#include <chrono>

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_nth_element(std::vector<T> v, typename std::vector<T>::size_type n) {
    auto target = v.begin()+n;
    std::nth_element(v.begin(), target, v.end(), Compare());
    std::vector<T> result(v.begin(), target);
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_pqueue(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> result;
    std::priority_queue<T, std::vector<T>, Compare> q;
    for (auto i: v) {
        q.push(i);
        if (q.size() > n) {
            q.pop();
        }
    }
    while (!q.empty()) {
        result.push_back(q.top());
        q.pop();
    }
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_set(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::set<T, Compare> s;
    for (auto i: v) {
        s.insert(i);
        if (s.size() > n) {
            s.erase(std::prev(s.end()));
        }
    }
    return std::vector<T>(s.begin(), s.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_deque(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::deque<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return std::vector<T>(q.begin(), q.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_vector(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return q;
}

template <typename Clock = std::chrono::high_resolution_clock>
struct stopclock {
    std::chrono::time_point<Clock> start;
    stopclock() : start(Clock::now()) {}
    ~stopclock() {
        auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start);
        std::cout << "elapsed: " << elapsed.count() << "ms\n";
    }
};

std::vector<int> random_data(std::vector<int>::size_type n) {
    std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist;
    std::vector<int> out(n);
    for (auto &i: out)
        i = dist(gen);
    return out;
}

int main() {
    std::vector<int> data = random_data(1000000);
    stopclock<> sc;
    std::vector<int> result = filter_nth_element(data, 5);
    std::cout << "smallest values: ";
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
    std::cout << "largest values: ";
    result = filter_nth_element<int, std::greater<int>>(data, 5);
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
}

Example output is:

$ g++ test.cc -std=c++11 && ./a.out
smallest values: 4433 2793 2444 4542 5557 
largest values: 2147474453 2147475243 2147477379 2147469788 2147468894 
elapsed: 123ms

Note that in this case only the position of the nth element is accurate with respect to the order imposed by the provided comparison operator. The other elements are guaranteed to be smaller/greater or equal to that one however, depending on the comparison operator provided. That is, the top n min/max elements are returned, but they are not correctly sorted.

Don't expect the other algorithms to produce results in a specific order either. (While the approaches using priority queue and set actually produce sorted output, their results have the opposite order).

For reference:

Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
4

I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points.

This reads as if you want to perform a k-nearest neighbor search in the first place. For this, you usually use specialized data structures (e.g., a binary search tree) to speed up the queries (especially when you are doing 100k of them).

For spherical coordinates you'd have to do a conversion to a cartesian space to fix the coordinate wrap-around. Then you'd use an Octree or kD-Tree.

Here's an approach using the Fast Library for Approximate Nearest Neighbors (FLANN):

#include <vector>
#include <random>
#include <iostream>
#include <flann/flann.hpp>
#include <cmath>

struct Point3d {
    float x, y, z;
    void setLatLon(float lat_deg, float lon_deg) {
        static const float r = 6371.; // sphere radius
        float lat(lat_deg*M_PI/180.), lon(lon_deg*M_PI/180.);
        x = r * std::cos(lat) * std::cos(lon);
        y = r * std::cos(lat) * std::sin(lon);
        z = r * std::sin(lat);
    }
};

std::vector<Point3d> random_data(std::vector<Point3d>::size_type n) {
    static std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist(0, 36000);
    std::vector<Point3d> out(n);
    for (auto &i: out)
        i.setLatLon(dist(gen)/100., dist(gen)/100.);
    return out;
}

int main() {
    // generate random spherical point cloud
    std::vector<Point3d> data = random_data(1000);
    // generate query point(s) on sphere
    std::vector<Point3d> query = random_data(1);

    // convert into library datastructures
    auto mat_data = flann::Matrix<float>(&data[0].x, data.size(), 3);
    auto mat_query = flann::Matrix<float>(&query[0].x, query.size(), 3);
    // build KD-Tree-based index data structure
    flann::Index<flann::L2<float> > index(mat_data, flann::KDTreeIndexParams(4));
    index.buildIndex();
    // perform query: approximate nearest neighbor search
    int k = 5; // number of neighbors to find
    std::vector<std::vector<int>> k_indices;
    std::vector<std::vector<float>> k_dists;
    index.knnSearch(mat_query, k_indices, k_dists, k, flann::SearchParams(128));

    // k_indices now contains for each query point the indices to the neighbors in the original point cloud
    // k_dists now contains for each query point the distances to those neighbors
    // removed printing of results for brevity
}

You'd receive results similar to this one (click to enlarge):

flann result

For reference:

moooeeeep
  • 31,622
  • 22
  • 98
  • 187
  • Thanks for this. I already use Vincenty Inverse Solution and have found some code for this but the nearest neighbor search info is useful, However there are some quirks with points that might appear to be at -179 and + 179 long as they might be very close on the sphere. – kingchris Dec 22 '15 at 19:19
  • @kingchris That might be a problem with the wrap around of the coordinates... Normally Vincenty should take care of this, shouldn't it? Have you tried Haversine (yes, it's less accurate)? – moooeeeep Dec 22 '15 at 19:32
  • Have tried both. Vincenty is more accurate and takes care of the squashed nature of the planet. – kingchris Dec 23 '15 at 03:56
2

Heap is the data structure that you need. pre-C++11 stl only had functions which managed heap data in your own arrays. Someone mentioned that boost has a heap class, but you don't need to go so far as to use boost if your data is simple integers. stl's heap will do just fine. And, of course, the algorithm is to order the heap so that the highest value is the first one. So with each new value, you push it on the heap, and, once the heap reaches 21 elements in size, you pop the first value from the heap. This way whatever 20 values remain are always the 20 lowest.

Dmitry Rubanovich
  • 2,471
  • 19
  • 27
  • Not sure about this solution. You see I have 100 000 points and I calculate the distance from each to all the others then I want to keep the shortest/closest points for another program. So the priority queue worked out as it sorts the distances for me. Yes if I could get a self sorting heap that would help – kingchris Dec 21 '15 at 05:35
  • @kingchris, priority queues are usually implemented as heaps. So much so that many people think of the two terms as synonymous. You would only need to allocate an array of 21 elements for the purposes of keeping track of the lowest 20 because after the first 20 you'd always follow a push with a pop. The std function is documented here: http://en.cppreference.com/w/cpp/algorithm/push_heap. – Dmitry Rubanovich Dec 21 '15 at 05:46
  • Ok I will look into a push heap,. – kingchris Dec 21 '15 at 05:50