1

I'm teaching myself c++ and eigen in one go, so maybe this is an easy question.

Given n and 0 "<" m "<" n, and an n-vector d of floats. To make it concrete:

VectorXf d = VectorXf::Random(n)

i would like to have a m-vector d_prim onf integers that contains the indexes of all the entries of d that are smaller or equal than the m-th largest entry of d. Efficiency matters. if there are draws in the data, then filling d_prim the first m entries of d that are smaller than its m-th largest entry is fine (i really need the index of m numbers that are not larger than the m^th largest entry of d).

I've tried (naively):

float hst(VectorXf& d,int& m){
//  VectorXf d = VectorXf::Random(n);
    std::nth_element(d.data().begin(),d.data().begin()+m,d.data().end());
    return d(m);
}

but there is two problems with it:

  1. it doesn't work
  2. even if it did work, i still have to pass over (a copy) of d once to find the indices of those entries that are smaller than d(m). Is this necessary?

Best,

user189035
  • 5,589
  • 13
  • 52
  • 112
  • If you're serious about this, you might like to investigate data structures that are optimised for k-th order statistics. You can build one based on a BST, but the standard library doesn't have anything like that. – Kerrek SB Feb 29 '12 at 23:00
  • I've been reading arround a bit (say here: http://www.disnetwork.info/1/post/2009/06/median-value-selection-algorithm.html & the comments). My understanding is that for the size of n's i'm dealing with here [10^3-10^4 floats] it will be hard to come up with an approach that is more than 2 times faster than nth_element. But i cuold be wrong. If you have examples, please post them. – user189035 Feb 29 '12 at 23:55
  • Actually, since you only need to obtain the mth element once, the solution David suggests is probably entirely sufficient. – Kerrek SB Mar 01 '12 at 06:20

2 Answers2

2

std::nth_element is what you want (contrary to what I said before). It does a partial so that the elements in the range [first, mth) are less than those in the range [mth, last). So after running nth_element all you have to do copy the first m elements to the new vector.

VextorXf d = VectorXf::Random(n);
VectorXi d_prim(m);

std::nth_element(d.data().begin(), d.data.begin() + m, d.data().end());
std::copy(d.data().begin(), d.data().begin() + m, d_prim.begin());

This answer has more info on algorithms to do this.

Community
  • 1
  • 1
David Brown
  • 13,336
  • 4
  • 38
  • 55
  • thanks, this is basically the construct a colleague recommended today. But it is O(nlog(n/2)) [and n is in the 10^3-10^4 range] whereas one based on nth_element would be O(n). the thing is that this part of the code has to run many, many times so i prefer it to be O(n). – user189035 Feb 29 '12 at 23:07
  • 1
    `partial_sort_copy` is O(n*log(m)). The one based on `nth_element` would be O(n*m) to construct the whole vector. I'm pretty sure sorting is the fastest way. – David Brown Feb 29 '12 at 23:13
  • As Kerrik SB says though, if you don't need to use a vector there may be a better way to do this with some other data structure. – David Brown Feb 29 '12 at 23:16
  • isn't nth_element O(n)? Then at most, one as to go once more over all entries of a copy of the original d and compare them to the m^th entry of the output of nth_element(). That would be at most another O(n). Where is the O(m*n) coming from? I feel i'm missing something here. Maybe you could add this to your answer for future reference? – user189035 Feb 29 '12 at 23:21
  • 1
    Yes, I incorrectly assumed what `nth_element` does. It should be possible to do it in O(n). – David Brown Feb 29 '12 at 23:49
  • 1
    So, to answer the actual question: a) use `nth_element(begin(), begin() + m, end())` to obtain the value of the mth-smallest element, and then pass over the original vector again to store the indices of all elements that are less than it. You'll have to think about what happens when all elements are equal, etc., I suppose. – Kerrek SB Mar 01 '12 at 06:24
1

Putting together David Brown's and Kerrek SB answers i got this as "the most efficient proposal":

VectorXi hst(VectorXf& d,int& h){
    VectorXf e = d;
    VectorXi f(h); 
    int j=0;
    std::nth_element(d.data(),d.data()+h,d.data()+d.size());
    for(int i=0;i<d.size();i++){
        if(e(i)<=d(h)){
            f(j)=i;
            j++;
        if(j==h) break; 
        } 
    }
    return f;
}
user189035
  • 5,589
  • 13
  • 52
  • 112