-1

I have a vector that allows for duplicates, I want to randomly chose an element with the probability that represents how many times an element was repeated.

For example - for the vector below, 6 should have the highest probability of being chosen. I thought about using rand(), but I am not quiet sure how to incorporate the probability.

vector A = [ 0, 0, 2, 2, 4, 5, 1, 6, 6, 6] 

thanks

GAURANG VYAS
  • 689
  • 5
  • 16
Kattie.S
  • 117
  • 1
  • 11

2 Answers2

-1

I think you are on the right way for getting a custom distribution of values. See the following code which demonstrates the access to the vector. Hope it helps.

#include <cstdlib>
#include <iostream>
#include <ctime>
#include <vector>

int main()
{
    std::vector<int> A { 0, 0, 2, 2, 4, 5, 1, 6, 6, 6 };
    std::srand(std::time(0)); // use current time as seed for random generator
    int random_pos = std::rand() % A.size();  // Modulo to restrict the number of random values to be at most A.size()-1
    int random_val = A[random_pos];
}
Stephan Lechner
  • 34,891
  • 4
  • 35
  • 58
  • thanks, this is what I was looking for .. but looking at other questions, many people mentioned that using the module would not create a random distribution.. so I am wondering if it would be okay in the case of this vector? – Kattie.S Jun 15 '17 at 21:09
  • A random generator is expensive to construct. Construct it only once with `static`. Further make it `thread_local`. If not, multiple threads will likely get same results. – user1587451 Jun 15 '17 at 21:15
  • `rand()` is a pseudo-random generator, which is sufficient for many cases, but will have shortcomings when it comes to cryptography, for example. The distribution of `rand()` is not guaranteed to be uniformly distributed, and so a modulo on `rand()` is not guaranteed to be, too. Yet you are controlling the "effective" distribution much more with the content of the array you pre-fill. So I'd not care to much about the modulo - any derivation from uniform distribution achieved by `rand() % 9` (in your case) will have very very little influence compared to the distribution of values in the vector – Stephan Lechner Jun 15 '17 at 21:16
  • @user1587451: generating random numbers and probably within several threads is a special topic; I think it goes beyond that what the OP formulated in the question. Whether std::rand() is thread save is implementation defined; yet I think one first has to define the multithreading setting. – Stephan Lechner Jun 15 '17 at 21:22
  • @StephanLechner you are right, the OP did not asked for that but so many cores out there, I presuppose the OP asked for a MT solution. – user1587451 Jun 15 '17 at 21:28
  • @StephanLechner: although not explicitly stated in the standard, `rand()` is uniform in every implementation I saw; the problem is that *the modulo* isn't going to be uniform, especially with larger modulo values. – Matteo Italia Jun 15 '17 at 22:22
  • @MatteoItalia In case anyone is curious why this is true, https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator this explains it really well. – ozeanix Jun 15 '17 at 22:52
  • @MatteoItalia in my case, it will be okay right? how big of values are you referring to? – Kattie.S Jun 16 '17 at 07:01
  • The thing starts to become a problem once `residuals = RAND_MAX % modulo` becomes some significant fraction of `RAND_MAX`; in general, unless `residuals == 0` (where the distribution remains uniform) if we call `buckets = RAND_MAX/modulo` (`/` here is integer division) every number has probability `float(buckets) / RAND_MAX` of being extracted, except numbers between 0 and `residuals`, which have `float(buckets + 1) / RAND_MAX`. – Matteo Italia Jun 16 '17 at 07:48
  • Say `RAND_MAX` is 32767 (as in VC++) and `modulo` is 20000; here you'll have `buckets = 1`, so P([0, residuals)) = 2. / RAND_MAX ≈ 6.1E-5, while P([residuals, modulo)) = 1. / RAND_MAX ≈ 3.1E-5. So, numbers between 0 and 12767 have twice the probability to be extracted. OTOH, on small numbers the difference is negligible; in your case (modulo = 9) you'd have `residuals = 7`, `buckets = 3640`; P([0, 7)) = (3640. + 1.) / 32767 = 0.11112, while P([7, 9)) = 3640. / 32767 = 0.11108. – Matteo Italia Jun 16 '17 at 07:52
  • To sum it up, the key metric here is `buckets + 1. / buckets`; if this is significantly larger than 1 (say, greater than 1.2), you have a problem. – Matteo Italia Jun 16 '17 at 07:59
-1

Maybe something like this (untested!):

#include <vector>
#include <random>
#include <iostream>

int main()
{
  std::vector<size_t> A{0, 0, 2, 2, 4, 5, 1, 6, 6, 6};

  static thread_local std::mt19937 g{std::random_device{}()};

  static thread_local std::uniform_int_distribution<size_t> d{0,A.size()};

  std::cout << A[d(g)] << std::endl;
}
user1587451
  • 978
  • 3
  • 15
  • 30