12

How can I efficiently select a random element from a std::set?

A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector

I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.

Is there a better approach?

In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".

Edit...

Many insightful answers below.

The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.

Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
  • 1
    Throw all iterators of the set into a `std::vector` and select randomly from that? – Xeo Sep 05 '12 at 19:33
  • 2
    Q: If you want to "access a random value", then you shouldn't be using a set in the first place, should you??? I like Xeo's suggestion: if you want to access the elements differently, then use a different accessor (e.g. "throw all the iterators into a std::vector"). – paulsm4 Sep 05 '12 at 19:36
  • @Xeo that's clever. You should post that. – Drew Dormann Sep 05 '12 at 19:39
  • @Xeo Would this be faster than incrementing through the list until the random index is reached? – Matt Phillips Sep 05 '12 at 19:56
  • @Matt: Depends on how often you need the random access, and how often the set changes. (And thanks to the randomness factor, you don't even need to keep the `std::vector` in order.) :) – Xeo Sep 05 '12 at 19:57
  • @Xeo Right, if the set doesn't change often this would be very efficient. – Matt Phillips Sep 05 '12 at 20:01
  • 1
    If the randomness is not very important, just use the first element in the set. – Willem Hengeveld Sep 10 '12 at 14:27
  • Possible duplicate of [How to select a random element in std::set?](http://stackoverflow.com/questions/3052788/how-to-select-a-random-element-in-stdset) – Ciro Santilli OurBigBook.com Feb 27 '17 at 10:02

7 Answers7

8

Use boost::container::flat_set instead:

boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();

Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.

Benjamin Lindley
  • 101,917
  • 9
  • 204
  • 274
4

What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.

Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.

Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).

Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:

#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>

using namespace std;

template <typename T>
struct RandomPredicate {
    RandomPredicate() : size(0), randomize(false) { }
    bool operator () (const T& a, const T& b) {
        if (!randomize)
            return a < b;

        int r = rand();
        if (size == 0)
            return false;
        else if (r % size == 0) {
            size = 0;
            return false;
        } else {
            size /= 2;
            return r & 1;
        }
    }

    size_t size;
    bool randomize;
};

int main()
{
    srand(time(0));

    RandomPredicate<int> pred;
    set<int, RandomPredicate<int> & > s(pred);
    for (int i = 0; i < 100; ++i)
        s.insert(i);

    pred.randomize = true;
    for (int i = 0; i < 100; ++i) {
        pred.size = s.size();
        set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
        cout << *it << endl;
    }
}

My half-baked randomness test is ./demo | sort -u | wc -l to see how many unique integers I get out. With a larger sample set try ./demo | sort | uniq -c | sort -n to look for unwanted patterns.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • 2
    You can't change the sort predicate for a set/map in your find/lower_bound call (unless you use `std::find` which is linear and asked to be avoided in the OP). – Mark B Sep 05 '12 at 19:52
  • 1
    @MarkB ah, I was adding the same thing in an edit when you were making your comment. I was hoping someone would know of a similar alternative... – Ben Jackson Sep 05 '12 at 19:54
  • I wonder if you could fool a set into behaving like it were a set with a different predicate. – Drew Dormann Sep 05 '12 at 20:13
  • This is very clever - the only thing I would suggest is to shift away say the lower eight bits of the random number before doing the test (in other words don't use bit 0 as your on/off random check). – Mark B Sep 06 '12 at 15:11
2

If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.

Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
1

You can use the std::advance method:

set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element

Another way to do this, probably less random:

int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);
Chris
  • 26,544
  • 5
  • 58
  • 71
  • 8
    `std::advance` has the same performance characteristics of using the increment operator `rnd` times which is what the OP is trying to avoid. – IronMensan Sep 05 '12 at 19:51
  • @IronMensan True. Unfortunately I don't think one can avoid doing that other than by building your own balanced binary tree and then traversing it randomly. – Chris Sep 05 '12 at 19:53
  • @IronMensan I gave this another shot, check my new answer for reference, if you're interested. – Chris Sep 05 '12 at 20:15
1

If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.

Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.

Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.

Mark B
  • 95,107
  • 10
  • 109
  • 188
1

You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.

The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;

Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.

Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).

Chris
  • 26,544
  • 5
  • 58
  • 71
0

Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers

The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.

However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.

template <typename T>
class UniformSamplingSet {
    size_t max_id = 0;
    std::unordered_set<size_t> unused_ids;
    std::unordered_map<size_t, T> id2value;
    std::map<T, size_t> value2id;

    void compact() {
        size_t id = 0;
        std::map<T, size_t> new_value2id;
        std::unordered_map<size_t, T> new_id2value;
        for (auto [_, value] : id2value) {
            new_value2id.emplace(value, id);
            new_id2value.emplace(id, value);
            ++id;
        }
        max_id = id;
        unused_ids.clear();
        std::swap(id2value, new_id2value);
        std::swap(value2id, new_value2id);
    }

public:
    size_t size() {
        return id2value.size();
    }

    void insert(const T& value) {
        size_t id;
        if (!unused_ids.empty()) {
            id = *unused_ids.begin();
            unused_ids.erase(unused_ids.begin());
        } else {
            id = max_id++;
        }
        if (!value2id.emplace(value, id).second) {
            unused_ids.insert(id);
        } else {
            id2value.emplace(id, value);
        }
    }

    void erase(const T& value) {
        auto it = value2id.find(value);
        if (it == value2id.end()) return;
        unused_ids.insert(it->second);
        id2value.erase(it->second);
        value2id.erase(it);
        if (unused_ids.size() * 2 > max_id) {
            compact();
        };
    }

    // uniform(n): uniform random in [0, n)
    template <typename F>
    T sample(F&& uniform) {
        size_t i;
        do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
        return id2value.at(i);
    }
Huazuo Gao
  • 1,603
  • 14
  • 20