C++: Sampling from discrete distribution without replacement

Question

I'd like to sample from a discrete distribution without replacement (i.e., without repetition).

With the function discrete_distribution, it is possible to sample with replacement. And, with this function, I implemented sampling without replacement in a very rough way:

#include <iostream>
#include <random>
#include <vector>
#include <array>

int main()
{
    const int sampleSize = 8;   // Size of the sample
    std::vector<double> weights = {2,2,1,1,2,2,1,1,2,2}; // 10 possible outcome with different weights

    std::random_device rd;
    std::mt19937 generator(rd());

    /// WITH REPLACEMENT

    std::discrete_distribution<int> distribution(weights.begin(), weights.end()); 

    std::array<int, 10> p ={};
    for(int i=0; i<sampleSize; ++i){
        int number = distribution(generator);
        ++p[number];
    }

    std::cout << "Discrete_distribution with replacement:" << std::endl;
    for (int i=0; i<10; ++i)
    std::cout << i << ": " << std::string(p[i],'*') << std::endl;


    /// WITHOUT REPLACEMENT

    p = {};
    for(int i=0; i<sampleSize; ++i){
        std::discrete_distribution<int> distribution(weights.begin(), weights.end()); 
        int number = distribution(generator);
        weights[number] = 0; // the weight associate to the sampled value is set to 0
        ++p[number];
    }

    std::cout << "Discrete_distribution without replacement:" << std::endl;
    for (int i=0; i<10; ++i)
    std::cout << i << ": " << std::string(p[i],'*') << std::endl;


    return 0;
}

Have you ever coded such sampling without replacement? Probably in a more optimized way?

Thank you.

Cheers,

T.A.

In think this article https://arxiv.org/abs/1603.06556 might be helpful. At the bottom there is an interesting algorithm for sampling from such distribtutions. But I didn't found any library having your desired function, at least no C++ library. — Aleph0, Dec 05 '18 at 13:13

Aleph0 · Answer 1 · 2018-12-07T06:59:22.273

This solution might be a bit shorter. Unfortunately, it needs to create a discrete_distribution<> object in every step, which might be prohibitive when drawing a lot of samples.

#include <iostream>
#include <boost/random/discrete_distribution.hpp>
#include <boost/random/mersenne_twister.hpp>

using namespace boost::random;

int main(int, char**) {
    std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 2 };
    discrete_distribution<> dist(w);
    int n = 10;
    boost::random::mt19937 gen;
    std::vector<int> samples;
    for (auto i = 0; i < n; i++) {
        samples.push_back(dist(gen));
        w[*samples.rbegin()] = 0;
        dist = discrete_distribution<>(w);
    }
    for (auto iter : samples) {
        std::cout << iter << " ";
    }

    return 0;
}

Improved answer:

After carefully looking for a similar question on this site (Faster weighted sampling without replacement), I found a stunningly simple algorithm for weighted sampling without replacement, it is just a bit complicated to implement in C++. Note, that this is not the most efficient algorithm, but it seems to me the simplest one to implement.

In https://doi.org/10.1016/j.ipl.2005.11.003 the method is described in detail.

Especially, it is not efficient if the sample size is much smaller than the basic population.

#include <iostream>
#include <iterator>
#include <boost/random/uniform_01.hpp>
#include <boost/random/mersenne_twister.hpp>

using namespace boost::random;

int main(int, char**) {
    std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 10 };
    uniform_01<> dist;
    boost::random::mt19937 gen;
    std::vector<double> vals;
    std::generate_n(std::back_inserter(vals), w.size(), [&dist,&gen]() { return dist(gen); });
    std::transform(vals.begin(), vals.end(), w.begin(), vals.begin(), [&](auto r, auto w) { return std::pow(r, 1. / w); });
    std::vector<std::pair<double, int>> valIndices;
    size_t index = 0;
    std::transform(vals.begin(), vals.end(), std::back_inserter(valIndices), [&index](auto v) { return std::pair<double,size_t>(v,index++); });
    std::sort(valIndices.begin(), valIndices.end(), [](auto x, auto y) { return x.first > y.first; });
    std::vector<int> samples;
    std::transform(valIndices.begin(), valIndices.end(), std::back_inserter(samples), [](auto v) { return v.second; });

    for (auto iter : samples) {
        std::cout << iter << " ";
    }

    return 0;
}

Easier answer

I just removed some of the STL functions and replaced it with simple for loops.

#include <iostream>
#include <iterator>
#include <boost/random/uniform_01.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <algorithm>

using namespace boost::random;

int main(int, char**) {
    std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 1000 };
    uniform_01<> dist;
    boost::random::mt19937 gen(342575235);
    std::vector<double> vals;
    for (auto iter : w) {
        vals.push_back(std::pow(dist(gen), 1. / iter));
    }
    // Sorting vals, but retain the indices. 
    // There is unfortunately no easy way to do this with STL.
    std::vector<std::pair<int, double>> valsWithIndices;
    for (size_t iter = 0; iter < vals.size(); iter++) {
        valsWithIndices.emplace_back(iter, vals[iter]);
    }
    std::sort(valsWithIndices.begin(), valsWithIndices.end(), [](auto x, auto y) {return x.second > y.second; });

    std::vector<size_t> samples;
    int sampleSize = 8;
    for (auto iter = 0; iter < sampleSize; iter++) {
        samples.push_back(valsWithIndices[iter].first);
    }
    for (auto iter : samples) {
        std::cout << iter << " ";
    }

    return 0;
}

Thank you for the suggestion. I was wondering if there was a way to avoid creating the discrete-distribution at each step. But it may not be that easy.... — T.A., Dec 05 '18 at 15:25
Thank your for the edit. I tried the code, but it does not sample a subset of the original vector. If I wrote the last element of the vector w as '2', I do not get a sample of 2 elements. I must say that the code is very complicated for a newbie like me. Could you help me with that? Cheers — T.A., Dec 06 '18 at 22:30
Ok, I just figured out that I have to take the first elements of the vector samples. Is that it? — T.A., Dec 06 '18 at 22:39
You are right. If you change for example the last weight to a high value, say 100. Then you have a high chance, that the sample 9 is the first in the list. Maybe the code would be more readable, if I remove the STL functions? — Aleph0, Dec 07 '18 at 06:09
It is very clear. And it does the job very well! Thank you for your help. — T.A., Dec 07 '18 at 08:45

score 0 · Answer 2 · answered May 24 '21 at 23:59

The existing answer by Aleph0 works the best of the ones I tested. I tried benchmarking the original solution, the one added by Aleph0, and a new one where you only make a new discrete_distribution when the existing one is over 50% already added items (redrawing when distribution produces an item already in the sample).

I tested with sample size == population size, and weights equal the index. I think the original solution in the question runs in O(n^2), my new one runs in O(n logn) and the one from the paper seems to run in O(n).

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Reuse             25252721 ns     25251731 ns           26
BM_NewDistribution   17338706125 ns  17313620000 ns         1
BM_SomePaper         6789525 ns      6779400 ns           100

Code:

#include <array>
#include <benchmark/benchmark.h>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_01.hpp>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>

const int sampleSize = 20000;

using namespace boost::random;

static void BM_ReuseDistribution(benchmark::State &state) {
  std::vector<double> weights;
  weights.resize(sampleSize);

  for (auto _ : state) {
    for (int i = 0; i < sampleSize; i++) {
      weights[i] = i + 1;
    }
    std::random_device rd;
    std::mt19937 generator(rd());
    int o[sampleSize];
    std::discrete_distribution<int> distribution(weights.begin(),
                                                 weights.end());
    int numAdded = 0;
    int distSize = sampleSize;
    for (int i = 0; i < sampleSize; ++i) {
      if (numAdded > distSize / 2) {
        distSize -= numAdded;
        numAdded = 0;
        distribution =
            std::discrete_distribution<int>(weights.begin(), weights.end());
      }

      int number = distribution(generator);
      if (!weights[number]) {
        i -= 1;
        continue;
      } else {
        weights[number] = 0;
        o[i] = number;
        numAdded += 1;
      }
    }
  }
}

BENCHMARK(BM_ReuseDistribution);

static void BM_NewDistribution(benchmark::State &state) {
  std::vector<double> weights;
  weights.resize(sampleSize);

  for (auto _ : state) {
    for (int i = 0; i < sampleSize; i++) {
      weights[i] = i + 1;
    }
    std::random_device rd;
    std::mt19937 generator(rd());
    int o[sampleSize];

    for (int i = 0; i < sampleSize; ++i) {
      std::discrete_distribution<int> distribution(weights.begin(),
                                                   weights.end());
      int number = distribution(generator);
      weights[number] = 0;
      o[i] = number;
    }
  }
}

BENCHMARK(BM_NewDistribution);

static void BM_SomePaper(benchmark::State &state) {
  std::vector<double> w;
   w.resize(sampleSize);
  for (auto _ : state) {
    for (int i = 0; i < sampleSize; i++) {
      w[i] = i + 1;
    }

    uniform_01<> dist;
    boost::random::mt19937 gen;
    std::vector<double> vals;
    std::generate_n(std::back_inserter(vals), w.size(),
                    [&dist, &gen]() { return dist(gen); });
    std::transform(vals.begin(), vals.end(), w.begin(), vals.begin(),
                   [&](auto r, auto w) { return std::pow(r, 1. / w); });
    std::vector<std::pair<double, int>> valIndices;
    size_t index = 0;
    std::transform(
        vals.begin(), vals.end(), std::back_inserter(valIndices),
        [&index](auto v) { return std::pair<double, size_t>(v, index++); });
    std::sort(valIndices.begin(), valIndices.end(),
              [](auto x, auto y) { return x.first > y.first; });
    std::vector<int> samples;
    std::transform(valIndices.begin(), valIndices.end(),
                   std::back_inserter(samples),
                   [](auto v) { return v.second; });
  }
}

BENCHMARK(BM_SomePaper);

BENCHMARK_MAIN();

score 0 · Answer 3 · answered Dec 19 '21 at 12:48

Thanks for your question and others' nice answer, I meet a same qustion as you. I think you needn't new distribution every time, instead of

dist.param({ wts.begin(), wts.end() });

complete codes are as follows:

//STL改进方案
#include <iostream>
#include <vector>
#include <random>
#include <iomanip>
#include <map>
#include <set>

int main()
{
//随机数引擎采用默认引擎
std::default_random_engine rng;

//随机数引擎采用设备熵值保证随机性
auto gen = std::mt19937{ std::random_device{}() };

std::vector<int> wts(24); //存储权重值

std::vector<int> in(24);  //存储总体

std::set<int> out;  //存储抽样结果

std::map<int, int> count;  //输出计数

int sampleCount = 0;  //抽样次数计数

int index = 0;  //抽取的下标

int sampleSize = 24;  //抽取样本的数量

int sampleTimes = 100000;  //抽取样本的次数

//权重赋值
for (int i = 0; i < 24; i++)
{
    wts.at(i) = 48 - 2 * i;
}

//总体赋值并输出
std::cout << "总体为24个:" << std::endl;

//赋值
for (int i = 0; i < 24; i++)
{
    in.at(i) = i + 1;

    std::cout << in.at(i) << " ";
}

std::cout << std::endl;

//产生按照给定权重的离散分布
std::discrete_distribution<size_t> dist{ wts.begin(), wts.end() };

auto probs = dist.probabilities(); // 返回概率计算结果

//输出概率计算结果
std::cout << "总体中各数据的权重为:" << std::endl;

std::copy(probs.begin(), probs.end(), std::ostream_iterator<double>
{ std::cout << std::fixed << std::setprecision(5), “ ”});

std::cout << std::endl << std::endl;

//==========抽样测试==========
for (size_t j = 0; j < sampleTimes; j++)
{
    index = dist(gen);

    //std::cout << index << “ ”;  //输出抽样结果

    count[index] += 1;  //抽样结果计数        
}

double sum = 0.0;  //用于概率求和

//输出抽样结果
std::cout << "总共抽样" << sampleTimes << "次，" << "各下标的频数及频率为：" << std::endl;

for (size_t i = 0; i < 24; i++)
{
    std::cout << i << "共有" << count[i] << "个   频率为：" << count[i] / double(sampleTimes) << std::endl;

    sum += count[i] / double(sampleTimes);
}

std::cout << "总频率为:" << sum << std::endl << std::endl;  //输出总概率
//==========抽样测试==========

//从总体中抽样放入集合中，直至集合大小达到样本数
while (out.size() < sampleSize - 1)
{
    index = dist(gen);  //抽取下标

    out.insert(index);  //插入集合

    sampleCount += 1;   //抽样次数增加1

    wts.at(index) = 0; //将抽取到的下标索引的权重设置为0
    
    dist.param({ wts.begin(), wts.end() });

    probs = dist.probabilities(); // 返回概率计算结果

    //输出概率计算结果
    std::cout << "总体中各数据的权重为：" << std::endl;

    std::copy(probs.begin(), probs.end(), std::ostream_iterator<double>
    { std::cout << std::fixed << std::setprecision(5), “ ”});

    std::cout << std::endl << std::endl;
}
//最后一次抽取，单独出来是避免将所有权重都为0，的权重数组赋值给离散分布dist，避免报错
index = dist(gen);  //抽取下标

out.insert(index);  //插入集合

sampleCount += 1;   //抽样次数增加1

//输出抽样结果
std::cout << "从总体中抽取的" << sampleSize << "个样本的下标索引为：" << std::endl;

for (auto iter : out)
{
    std::cout << iter << “-”;
}

std::cout << std::endl;

//输出抽样次数
std::cout << "抽样次数为:" << sampleCount << std::endl;

out.clear(); //清空输出集合，为下次抽样做准备

std::cin.get(); //保留控制台窗口
return 0;
}

C++: Sampling from discrete distribution without replacement

3 Answers3