3

I have code that creates several object instances (each instance having a fitness value, among other things) from which I want to sample N unique objects using weighted selection based on their fitness values. All objects not sampled are then discarded (but they need to be initially created to determine their fitness value).

my current code looks something like this:

vector<Item> getItems(..) {
    std::vector<Item> items
    .. // generate N values for items
    int N = items.size();

    std::vector<double> fitnessVals;
    for(auto it = items.begin(); it != items.end(); ++it)
        fitnessVals.push_back(it->getFitness());

    std::mt19937& rng = getRng();
    for(int i = 0, i < N, ++i) {
        std::discrete_distribution<int> dist(fitnessVals.begin() + i, fitnessVals.end());
        unsigned int pick = dist(rng);
        std::swap(fitnessVals.at(i), fitnessVals.at(pick));
        std::swap(items.at(i), items.at(pick));
    }
    items.erase(items.begin() + N, items.end());
    return items;
}

Typically ~10,000 instances are initially created, with N being ~200. The fitness value is non-negative, usually valued at ~70. It could go as high as ~3000, but higher values are increasingly more unlikely.

Is there an elegant way to get rid of the fitnessVals vector? Or perhaps a better way to do this in general? Efficiency is important, but I'm also wondering about good C++ coding practices.

Sveltely
  • 822
  • 1
  • 8
  • 22
  • Your usage of `discrete_distribution` is incorrect; when you want to generate several items following a given distribution, all items need to use the same distribution **object**. Therefore, the distribution need be created outside the loop. Of course, I realize it is not satisfactory in your case, but maybe that just hints at a bad choice of distribution, or maybe there is another way to use it. – Matthieu M. Mar 12 '14 at 09:02
  • 1
    I found something regarding your randomness issue: [Select random k elements from a list whose elements have weight](http://stackoverflow.com/q/2140787/147192), I believe it means you should create your own distribution to implement the first answer `n_random_numbers_decreasing` function. – Matthieu M. Mar 12 '14 at 09:20

3 Answers3

3

If you're asking whether you can do this just with the items in your items vector, the answer is yes. The following is a rather hideous but none-the-less effective way to do that: I apologize in advance for the density.

This wraps the unsuspecting container iterator in another iterator of our own devices; one that pairs it with a member function of your choice. You may have to dance with const in this to get it to work correctly with your member function choice. That task i leave to you.

template<typename Iter, typename R>
struct memfn_iterator_s :
    public std::iterator<std::input_iterator_tag, R>
{
    using value_type = typename std::iterator_traits<Iter>::value_type;

    memfn_iterator_s(Iter it, R(value_type::*fn)())
        : m_it(it), mem_fn(fn) {}

    R operator*()
    {
        return ((*m_it).*mem_fn)();
    }

    bool operator ==(const memfn_iterator_s& arg) const
    {
        return m_it == arg.m_it;
    }

    bool operator !=(const memfn_iterator_s& arg) const
    {
        return m_it != arg.m_it;
    }

    memfn_iterator_s& operator ++() { ++m_it; return *this; }

private:
    R (value_type::*mem_fn)();
    Iter m_it;
};

A generator function follows to create the above monstrosity:

template<typename Iter, typename R>
memfn_iterator_s<Iter,R> memfn_iterator(
    Iter it, 
    R (std::iterator_traits<Iter>::value_type::*fn)())
{
    return memfn_iterator_s<Iter,R>(it, fn);
}

What this buys you is the ability to do this:

auto it_end = memfn_iterator(items.end(), &Item::getFitness);
for(unsigned int i = 0; i < N; ++i)
{
    auto it_begin = memfn_iterator(items.begin()+i, &Item::getFitness);
    std::discrete_distribution<unsigned int> dist(it_begin, it_end);
    std::swap(items.at(i), items.at(i+dist(rng)));
}
items.erase(items.begin() + N, items.end());

No temporary array is required. The member function is called for the respective item when required by the discrete distribution (which usually keeps it own vector of weights, and as such replicating that effort would be redundant).

Dunno if you'll get anything helpful or useful out of that, but it was fun to think about.

WhozCraig
  • 65,258
  • 11
  • 75
  • 141
  • I think there is an error here: a distribution has a state, and thus should NOT be created again and again in the loop. See the example on [cppreference](http://en.cppreference.com/w/cpp/numeric/random/discrete_distribution): the distribution is outside the loop. The other thing that I wonder about is the `swap`: it is possible that `dist(rng)` will produce an index `[0, i]`, in which case your swap is weird. – Matthieu M. Mar 12 '14 at 08:03
  • @MatthieuM. I understand a discrete distribution has state. So does the OP. Once a weighted choice is made, it is removed from the collection (swapped with incrementing iterator starting at the beginning). That item and its weight are no longer used in the followup weighted choices to come. Thus a new distribution based on the *remaining* items is used for each choice, unclouded by the already-chosen item's weight. Why do this? The OP wants to. Regarding using `diet(rng)` I think your right. (though your math is off, I think its [0..items.size()-i]), and thus i should be added. nice catch. – WhozCraig Mar 12 '14 at 08:16
  • I understand the reason you (and the OP) create a new distribution each time, but my point is that a `std::*_distribution` object respects the distribution it is named after *only if used throughout* the choice*. I also understand it is not your fault, since this is already what the OP did. – Matthieu M. Mar 12 '14 at 08:58
  • @MatthieuM. I completely concur with you. it seemed odd to me as well. Hopefully at least the elimination of the temp-array was beneficial none-the-less. – WhozCraig Mar 12 '14 at 09:03
  • Yes, I upvoted you for this reason; your answer is spot on with regard to the formulated question, the distribution issue can be solved elsewhere. – Matthieu M. Mar 12 '14 at 09:04
  • @MatthieuM. Danka, and thanks again for catching the bogus index on the rhs of that swap. – WhozCraig Mar 12 '14 at 09:07
  • Upvote for the iterator implementation. Couldn't come up with that at 3am this morning, and since I have not written any custom iterators, even less chance of it happening. I mentioned in a comment to my response the Boost iterator library - would that also be a viable solution to the OP's problem, using something similar to your answer? – Will Mar 12 '14 at 13:16
  • That's disgustingly beautiful (in a good way.. I think.). Also, @MatthieuM., discrete_distribution [produces values on the interval [0, n)](http://en.cppreference.com/w/cpp/numeric/random/discrete_distribution). Good catch on distribution having a state though, I totally missed that – Sveltely Mar 13 '14 at 01:32
2

It's pretty nice that they have a discrete distribution in STL. As far as I know, the most efficient algorithm for sampling from a set of weighted objects (i.e., with probability proportional to weights) is the alias method. There's a Java implementation here: http://www.keithschwarz.com/interesting/code/?dir=alias-method

I suspect that's what the STL discrete_distribution uses anyway. If you're going to be calling your getItems function frequently, you might want to create a "FitnessSet" class or something so that you don't have to build your distribution every time you want to sample from the same set.

EDIT: Another suggestion... If you want to be able to delete items, you could instead store your objects in a binary tree. Each node would contain the sum of the weights in the subtree beneath it, and the objects themselves could be in the leaves. You could select an object through a series of log(N) coin tosses: at a given node, choose a random number between 0 and node.subtreeweight. If it's less than node.left.subtreeweight, go left; otherwise go right. Continue recursively until you reach a leaf.

rainbowgoblin
  • 1,221
  • 12
  • 28
  • After I select an item, I recreate the distribution again so that it does not include the selected item... resulting in `N` recreations. If I used the alias method, wouldn't I still have to recreate the [alias structure] after every sample (as `N` changes)? – Sveltely Mar 12 '14 at 06:44
1

I would try something like the following (see code comments):

#include <algorithm>                 // For std::swap and std::transform
#include <functional>                // For std::mem_fun_ref
#include <random>                    // For std::discrete_distribution
#include <vector>                    // For std::vector
size_t
get_items(std::vector<Item>& results, const std::vector<Item>& items)
{
    // Copy the items to the results vector. All operations should be
    // done on it, rather than the original items vector.
    results.assign(items.begin(), items.end());

    // Create the fitness values vector, immediately allocating
    // the number of doubles required to match the size of the
    // input item vector.
    std::vector<double> fitness_vals(results.size());

    // Use some STL "magic" ...
    // This will iterate over the items vector, calling the
    // getFitness() method on each item, and storing the result
    // in the fitness_vals vector.
    std::transform(results.begin(), results.end(),
                   fitness_vals.begin(),
                   std::mem_fun_ref(&Item::getFitness));

    //
    std::mt19937& rng = getRng();
    for (size_t i=0; i < results.size(); ++i) {
        std::discrete_distribution<int> dist(fitness_vals.begin() + i, fitness_vals.end());
        unsigned int pick = dist(rng);
        std::swap(fitness_vals[ii], fitness_vals[pick]);
        std::swap(results[i], results[pick]);
    }

    return (results.size());
}

Instead of returning the results vector, the caller provides a vector into which the results should be added. Also, the original vector (passed as the second parameter) remains unchanged. If this is not something that concerns you, you can always pass just the one vector and work with it directly.

I don't see a way to not have the fitness values vector; the discrete_distribution constructor needs to have the begin and end iterators, so from what I can tell, you will need to have this vector.

The rest of it is basically the same, with the return value being the number of items in the result vector, rather than the vector itself.

This example makes use of a number of STL features (algorithms, containers, functors) which I have found to be useful and part of my day-to-day development.


Edit: the call to items.erase() is superfluous; items.begin() + N where N == items.size() is equivalent to items.end(). The call to items.erase() would equate to a no-op.

Will
  • 3,500
  • 4
  • 30
  • 38
  • By assigning `results` to be from `items.begin()` to `items.end()`, aren't you causing it to set `N = items.size()`, resulting in the function just shuffling the elements of items? STL magic sounds cool though. I was hoping there was some way to use said magic to create an `vector::iterator` that returns the result of `getFitness()` when dereferenced. Or whatever the correct syntax would be. – Sveltely Mar 12 '14 at 06:58
  • `results.assign()` is copying the contents of the `items` vector into the `results` vector. This may or may not be what you want to do, but it is common to see something like this when the original set of data is to be left unmodified; make a copy and work on the copy. In theory, I believe that, yes, you could write an iterator to do what you want, but I'm not terribly sure how to do it; a bit beyond my level of expertise at the moment. – Will Mar 12 '14 at 13:02
  • Boost provides an [iterator](http://www.boost.org/doc/libs/1_55_0/libs/iterator/) library that is meant for writing iterator classes, but I am not familiar with it or how to use it. The documentation for most of the Boost libraries is generally very good and provides examples; I would start there if you want to go the route of a custom iterator. – Will Mar 12 '14 at 13:07