3

Consider I am given a specific range (0 to 5,000,000) and I should generate 2,500,000 unique random numbers from this range. What is an efficient way to do this? I understand that is tough to get true random numbers.

I tried by checking if a number exists so that I can generate a new random number. But it takes hours to compute. Is there a better way to do this.

The reason behind this is, I have a vector of size 5,000,000. I want to shrink the vector exactly by half. i.e. delete random 50% of the elements from the vector.

    #include <iostream>
    #include <vector>
    #include <stdlib.h>
    #include <algorithm>
    using namespace std;

    #define NUMBER 2500000
    #define RAND_START 0
    #define RAND_END 5000000

    unsigned int generate_random_number(int min, int max)
    {
        return min + (rand() % (unsigned int)(max - min + 1));
    }

    int main(int argc, char* argv[])
    {
        unsigned int count = 0, random_number;
        vector<unsigned int> rand_vector;
        do 
        {   
            count++;
            random_number = generate_random_number(RAND_START,RAND_END);
// Tried to manually add a different number each time. But still not a considerable improvement in performance. 
            if (std::find(rand_vector.begin(), rand_vector.end(), random_number) != rand_vector.end())
            {
                if(random_number > count)
                    random_number = random_number - count;
                else
                    random_number = random_number + count;          
            }
            rand_vector.push_back(random_number);
            sort(rand_vector.begin(), rand_vector.end());
            rand_vector.erase(unique (rand_vector.begin(), rand_vector.end()), rand_vector.end());
        }while (rand_vector.size() != NUMBER);


        for (unsigned int i =0; i < rand_vector.size(); i++)
        {
            cout<<rand_vector.at(i)<<", ";
        }
        cout<<endl;
        return 0;
    }

Any better approach by which I can do this?

SyncMaster
  • 9,754
  • 34
  • 94
  • 137

4 Answers4

5

You seem to be locked on an idea that you have to pre-generate your random numbers somehow. Why? You said that the ultimate task is to delete some random elements from a vector. For that specific problem it is not necessary to pre-generate all random indices in advance. You can simply generate these indices "on the fly".

For this specific task (i.e. delete 50% of elements in the vector), Knuth algorithm will work pretty well (see https://stackoverflow.com/a/1608585/187690).

Just iterate through all elements of the original vector from 0 to N-1 and make a random decision to delete i-th element with the probability of N_to_delete / N_to_iterate, where N_to_delete is the number of elements that still have to be deleted, and N_to_iterate is the length of the remaining portion of the vector. This approach does it in one pass (if implemented smartly), requires no extra memory and no trial-and-error iterations. It simply does exactly what you want it to do: destroys 50% of vector elements with equal probability.

Knuth algorithm works best in situations when the number of random values (M) is fairly large compared to the length of the range (N), since its complexity is tied to N. In your case, where M is 50% of N, using Knuth algorithm is a pretty good idea.

When the number of random values is much smaller than the range (M << N), Bob Floyd algorithm (see the above link) makes more sense, since its complexity is defined by M and not by N. It requires additional memory (a set), but still makes no trial-and-error iterations when generating random numbers.

However, in your case you are trying to delete elements from a vector. Vector element deletion is dominated by N, which defeats the benefits of Bob Floyd algorithm anyway.

Community
  • 1
  • 1
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • But I'd go from the end of the array — less data to *move* when deleting ;) – Michael Krelin - hacker Aug 23 '12 at 18:33
  • @Michael Krelin - hacker: Well, if instead of using `vector::erase` you do it *smartly*, by manual copying of surviving vector elements to the left, the implementation will work efficiently (i.e. in one pass) regardless of the direction. Each element will be copied only once. Actually, this is how `std::remove_if` works. – AnT stands with Russia Aug 23 '12 at 18:36
2

Instead of doing a manual check if you have unique numbers, you could use e.g. std::unordered_set and continue generating numbers until the size of the set is the number of numbers you want.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • Aside from needing 2.5 million nodes to manage the numbers, this doesn't avoid checking for unique numbers; it just pushes responsibility for checking into the `unordered_set` object. Granted, `unordered_set` will do this more efficiently than a `vector`, but generating 2.5 million unique values still takes a lot of time. – Pete Becker Aug 23 '12 at 19:44
  • @Pete: I doubt it's all that bad -- at worst only half the indexes have been selected and hence need to be "re-rolled", which means you'll only need to do about twice as many selections as the number of values you need. But I haven't tested it. – Steve Jessop Aug 23 '12 at 23:56
  • @SteveJessop - it's not the duplicates, it's the memory allocation, searching, and rebalancing. – Pete Becker Aug 24 '12 at 12:08
  • @Pete: OK, but `unordered_set` is a hashset, so it shouldn't be outrageously slow. I guess it depends what you mean by "a lot of time". A lot of milliseconds, sure :-) – Steve Jessop Aug 24 '12 at 12:10
  • @SteveJessop - you're right. I was thinking of a `set`. There's still the memory overhead, one node for each entry. But searching and inserting are much faster with an `unordered_set`. – Pete Becker Aug 24 '12 at 12:16
2

Easiest way to code it:

std::random_shuffle(vectoshrink.begin(), vectoshrink.end());
vectoshrink.resize(vectoshrink.size() / 2);

If you want to maintain the order of the elements in vectoshrink use AndreyT's answer.

If you really do want to select the indexes in advance:

std::vector<size_t> vec(vectoshrink.size());
// iota is C++11, but easy to do yourself
std::iota(vec.begin(), vec.end(), size_t(0));
std::random_shuffle(vec.begin(), vec.end());
vec.resize(vec.size() / 2);
// optionally
std::sort(vec.begin(), vec.end());

Now, you could use those indexes to shrink your original vector by copying the elements at the indexes in vec into a new vector, and swap the result with the original.

In both cases, random_shuffle does more than is strictly required, since it shuffles the whole vector, whereas actually we only need to "shuffle" half of it. If you read how a Fisher-Yates shuffle works, though, it's easy to see that if you code it yourself then the only modification required is to do half as many steps as the full shuffle. C++ doesn't have a standard partial_random_shuffle, though.

Finally, beware that the default random source might not be very good, so you might want to use the three-argument version of random_shuffle. Your generate_random_number function is quite biassed for certain values of min and max, so you might want to research a bit more on the general theory of random number generation.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
0

Generate 1st number <5M, 2nd number <(5M-1), etc. Each time after you remove element, you'll have one element less and you don't care if it's the same number. ;-) This doesn't answer your question about unique numbers, but about halving your vector.

And you will not have to generate more numbers than you need.

Michael Krelin - hacker
  • 138,757
  • 24
  • 193
  • 173
  • Well, these numbers won't bee too random: if you saw a number > 5M-1, you can be sure that other numbers will be < 5M-1 – Vlad Aug 23 '12 at 17:45
  • Will they be much less random? – Michael Krelin - hacker Aug 23 '12 at 17:45
  • True random numbers have a chance of being all equal to 5M-1 -- which will never be by your algo. – Vlad Aug 23 '12 at 17:47
  • And you can be sure that you don't have more elements too. You know, the 2nd number will have smaller range with unique constraint, anyway. – Michael Krelin - hacker Aug 23 '12 at 17:47
  • Oh, I see. But anyway, the average will be biased towards 0 in your case. – Vlad Aug 23 '12 at 17:49
  • @Vlad, yes and no. The average of the element index *in the original array* won't shift. – Michael Krelin - hacker Aug 23 '12 at 17:50
  • Well, let's consider a simpler case of RAND_END = 2, NUMBER = 1. The generated number is 0 or 1, gets changed into 1 or 2 by adding `count`, so it is outside the range :) (It's not connected to bias, but nevertheless...) – Vlad Aug 23 '12 at 17:58
  • For the less simple case RAND_END = 4, NUMBER = 2: first number after modification by `count` will be equally distributed on [1, 4]. The second generated number will be equally distributed on [0, 2], count = 2; after modification it'll be equally distributed on [2, 4]. Result: {3, 4}'s probability is twice as the probability of {1, 2}, because 1 can be generated only at the 1st step. – Vlad Aug 23 '12 at 18:03
  • Yes: I am comparing the probability of getting {1, 2} with probability of getting {3, 4}. For random set, they should be equal. – Vlad Aug 23 '12 at 22:36
  • Isn't probability of 1st and 2nd elements of the *original* array to get removed the same as probability of the 3rd and 4th? Of the *original* array? – Michael Krelin - hacker Aug 23 '12 at 22:43