4

I have a data vector A of length 1 Million (0 to 1 Million). From A, I want to create the vector B (whose length is lets say just 10% of A) containing indexes of A. Those indexes are randomly taken sample indexes from A. I tried using srand() and random_shuffle, is this a good way to extracting samples for very huge vectors? Can anyone plz suggest me.

  std::vector <int> samplingIndex;

   for (int i = 0; i < 1000000; ++i) { samplingIndex.push_back(i); } 
   std::srand(50); 
   std::random_shuffle(samplingIndex.begin(), samplingIndex.end());

After this I take the first 10% indexes from samplingIndex to make B.

Hum
  • 464
  • 2
  • 5
  • 11

4 Answers4

6

You may use Fisher–Yates shuffle and then avoid to construct the huge array a:

Something like:

// Fisher–Yates_shuffle
std::vector<int> FisherYatesShuffle(std::size_t size,
                                    std::size_t max_size,
                                    std::mt19937& gen)
{
    assert(size <= max_size);
    std::vector<int> res(size);

    for (std::size_t i = 0; i != max_size; ++i) {
        std::uniform_int_distribution<> dis(0, i);
        std::size_t j = dis(gen);
        if (j < res.size()) {
            if (i < res.size()) {
                res[i] = res[j];
            }
            res[j] = i;
        }
    }
    return res;
}

Live example

Community
  • 1
  • 1
Jarod42
  • 203,559
  • 14
  • 181
  • 302
0

Seems reasonable. One tweak is you could replace your for loop with this to avoid repeated reallocation of the vector:

std::vector <int> samplingIndex(1000000);
std::iota(samplingIndex.begin(), samplingIndex.end(), 0);

If your take-percentage is much smaller than 10%, it would be worthwhile to just generate random numbers in [0, len(A)) until you get len(B) distinct values.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • thanks @ John . is random_shuffle a good sampler (uniform sampler?). For example I want to observe the no.of bits with error in 2 vectors of huge lengths (1M) by comparing bit by bit.The comparision should be done with less than 10% bits (those bits should represent the tendency of error for the whole vector). So the 10% bits extracted from random_shuffle should be uniform. i.e. the errors % obtained in extracted 10% bits and 20% should be more or less same. – Hum Sep 16 '14 at 09:01
  • 1
    @Hum See my comment above. `random_shuffle` itself uses an unbiased algorithm but the random number generator it uses is biased. Use `std::shuffle` for better results. – Konrad Rudolph Sep 16 '14 at 09:06
  • @John: not distinct values but distinct indexes... it's very different – Gianluca Ghettini Sep 16 '14 at 09:33
0

Your code is written using old C++. I think you should look closely to random in new C++11/14.

http://en.cppreference.com/w/cpp/algorithm/random_shuffle

Dakorn
  • 883
  • 6
  • 11
0

if your input is from an AWGN source (or close to it) you can just pick 1 sample every 10 samples and do the job in O(N) time (you want 10% of random samples right?)

otherwise a very efficient way to extract 10% of random samples from a huge vector is to pick samples at random storing every time the selected index. Keep picking random items and repeat if the index was already taken. Yes, is a probabilistic approach but you achieve O(N) complexity on the best and average scenario. The worst case is that you keep selecting the same index again and again but this would mean a very very bad PRNG implementation: you can assume the worst case to be a very unlikely scenario (just keep the odds sufficiently low as in an hash function)

you could also use a linked list and "short-circuit" the selected samples (reducing the PRNG output space to N-1) but this would require extra memory to store the linked list.

Gianluca Ghettini
  • 11,129
  • 19
  • 93
  • 159