-1

I have a C++ map: std::map <std::string, int>

I would like to pick p percentage of random elements from this map. Here p is dynamic. For example, 10% or 30% of all Key:Value pairs from this map but picked randomly. Cannot use c++11.

What is the best way to do this?

Thank you.

Galik
  • 47,303
  • 4
  • 80
  • 117
user629034
  • 659
  • 2
  • 11
  • 30

2 Answers2

2
  • Initialize a vector of bools to be the same size as the map
  • compute T = map.size() * percentage
  • Initiale the first T elements of the vector as "true", the rest false
  • randomly shuffle elements in the vector
  • iterator over the map and vector together - designated an item in the map when the corresponding index position in vector is true

Sample code:

#include <iostream>
#include <map>
#include <vector>
#include <string>

using namespace std;

void getRandomMapElements(map<string, int>& items, double percentage)
{
    const size_t count = items.size();
    vector<bool> vec;
    vec.resize(count); // all items in vec are "false"

    if (percentage < 0)
    {
        percentage = 0;
    }
    else if (percentage > 1.0)
    {
        percentage = 1.0;
    }

    size_t target = (size_t)(count * percentage); // actual number of items extracted

    // fill up the first TARGET count elements of the vector with true, the rest are kept at false
    for (size_t i = 0; i < target; i++)
    {
        vec[i] = true;
    }

    // shuffle the boolean vector
    for (size_t i = 0; i < count; i++)
    {
        bool val = vec[i];
        size_t swap = rand() % count;
        vec[i] = vec[swap];
        vec[swap] = val;
    }

    // iterate over the vector and map together
    map<string, int>::iterator itor = items.begin();
    for (size_t i = 0; i < count; i++)
    {
        if (vec[i])
        {
            cout << itor->first << " : " << itor->second << endl;
        }
        itor++;
    }
}
selbie
  • 100,020
  • 15
  • 103
  • 173
1

With C++17 std::sample does exactly what you need, but for c++98 it is slightly more complicated.

The shortest code that is compatible with c++98 is:

unsigned pick_below(unsigned n)
{
     // poor distribution:
     return std::rand() % n;
}
std::vector<std::pair<std::string, int> >
sample(const std::map<std::string, int> & data_in,
       unsigned p)
{
 std::vector<std::pair<std::string, int> > shuffled(data_in.begin(), data_in.end());
 for (unsigned i=shuffled.size()  ; i > 1 ; --i)
   std::swap(shuffled[i-1], shuffled[pick_below(i)]);
 shuffled.erase(shuffled.begin() +p, shuffled.end());
}

This code is problematic on two levels:

  1. std::random quality is not guaranteed.
  2. using modulo in pick_below distorts the distribution.

To overcome issue number 2, either use boost::random::uniform_int_distribution or rewrite the pick_below function according to this:

unsigned pick_below(unsigned n)
{
    unsigned x;
    do {
       x = rand();
    } while (x >= (RAND_MAX - RAND_MAX % n));
    return x % n;
}

Fixing issue 1 can be overcome by using a third-party random generator such as boost::random::mt19937.

Unfortunately, the complexity of this solution is O(n) on average (since pick_below is not guaranteed to terminate, but on any value p < RAND_MAX / 2 the probability to iterate it more than K times diminishes exponentially to less than 0.5K. The complexity can't be better than O(n) since there is no way to pick a kth element of a map, short of iterating all of it.

Michael Veksler
  • 8,217
  • 1
  • 20
  • 33