3

I need to generate random strings efficiently. In the following, you will see my first try. I compiled the code with gcc and -O3 optimization level. It takes 18.5 seconds to generate 10^7 random strings of length 64:

#include <iostream>
#include <random>
#include <algorithm>

std::string chars {"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()`~-_=+[{]{|;:'\",<.>/?"};
std::random_device rd;
std::mt19937 generator(rd());
  
std::string rand_str (int length) {
  std::string output (chars);
  std::shuffle(output.begin(), output.end(), generator);
  return output.substr(0, length);
}

int main() {
  std::string str;
  for (long i=0; i<10000000; ++i)
      str = rand_str (64);
}

I checked std::sample in c++17 and it is not faster than the above method. In addition, it will not change the order of characters and so it is not really random.

Edit: The std::shuffle is not a good choice, since, it will not allow duplicates. Based on comments I modified the code. This time it takes more than 9 minutes for 10^7 random numbers.

std::string rand_str (size_t length) {
  const size_t char_size = chars.size();
  std::uniform_int_distribution<> random_int (0, char_size - 1);
  std::string output;
  for (size_t i=0; i<length; ++i)
    output.push_back(chars[random_int(generator)]);
  return output;
}

Question

  • Are there more efficient ways to do this in modern C++?

I appreciate any suggestions to improve the code.

Ali
  • 1,023
  • 1
  • 9
  • 31
  • 7
    First of all, you don't say anything about the requirements of the random string. Based on your code the requirement is, that each random string must not have duplicate chars. – t.niese Aug 11 '20 at 12:24
  • 1
    `std::mt19937` has relatively good PRNG "quality" but also it isn't superfast. There are faster PRNGs, such as [Xorshift](https://en.wikipedia.org/wiki/Xorshift). – Daniel Langr Aug 11 '20 at 12:32
  • 4
    Another observations: 1) Your code has no observable effect, it may be therefore completely optimized-out by the compiler to `return 0;` in `main`. 2) There are allocations in each iteration, which are unnecessary. – Daniel Langr Aug 11 '20 at 12:35
  • That's a lot of memory allocations and deallocations, and less than one microsecond per string (you create two for each result) is not terribly bad. – molbdnilo Aug 11 '20 at 13:17
  • @t.niese You are right. I updated the question. Actually, the duplicate is allowed. I would like to have a homogeneous random string with high entropy. – Ali Aug 11 '20 at 13:20
  • 2
    If you don't want to have duplicates, then a random number generator from `0` to size of `chars`, and then using that the numbers of that random generator for a `chars[random_number]` access could be faster. – t.niese Aug 11 '20 at 13:22
  • Note: https://stackoverflow.com/questions/25298585/efficiently-generating-random-bytes-of-data-in-c11-14 exists. So what's the problem? Generating random bits? or transforming random bits into strings with an alphabet of 94 chars and length 64? – Wyck Aug 11 '20 at 14:51
  • 1
    Generating random numbers is a lot slower than you might expect. It can easily lead to being the gating factor. – Mark Ransom Aug 11 '20 at 15:21
  • @DanielLangr Thanks for the feedback. Which allocations can be avoided in the code? I think I have to create std::string for the function return. Right? – Ali Aug 11 '20 at 15:38
  • @molbdnilo Do you mean I can avoid an unnecessary allocation? – Ali Aug 11 '20 at 15:39
  • 1
    See https://stackoverflow.com/a/20818831/5987 for generating lots of small random numbers with fewer calls to the random number generator. – Mark Ransom Aug 11 '20 at 15:50
  • 1
    @ali That very much depends on what you want to do with the resulted strings in iterations, which you don't specify. You can, for example, return a `std::string_view` object or a reference to `static` local string to avoid allocations. – Daniel Langr Aug 12 '20 at 08:44
  • 1
    In your edit, `random_int(rd)` should likely be `random_int(generator)` instead. `std::random_device` is very slow. – Daniel Langr Aug 12 '20 at 13:44
  • 1
    I was able to speed up your generation 1.4 times by avoiding allocations and using a faster prng (xorshift64): [live benchmark](https://quick-bench.com/q/dPpa3A6s4OooNnZ_82utgmU4BGI). An additional significant speedup was also possible, but at a price of 1) using modulo instead and/of uniform distribution or 2) restricting the number of considered characters to 64: [live benchmark](https://quick-bench.com/q/4jNytQL0MKTuy1HBl43ABh7-Ny8). – Daniel Langr Aug 12 '20 at 14:16
  • @DanielLangr hello! could you please explain the xor part, why those numbers? x ^= x << 13; x ^= x >> 7; x ^= x << 17; – SimpleAsk Dec 16 '20 at 10:14
  • @SimpleAsk Learn about the theory of PRNGs nad xorshift; Wikipedia has many links to corresponding material in the References section. – Daniel Langr Dec 16 '20 at 11:05

1 Answers1

1
#include <iostream>
#include <random>
#include <algorithm>
#include <chrono>

std::string chars {"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()`~-_=+[{]{|;:'\",<.>/?"};
std::random_device rd;
std::mt19937 generator(rd());
  
std::string rand_str(int length) {
  std::string output;
  output.reserve(length);

  while(length>0)
  {
      auto randNumb = generator();
      while(randNumb > 93 && length--)
      {
        output.push_back(chars[randNumb%93]);
        randNumb/=93;
      }
  }
  return output;
}

int main() {
  auto startTP = std::chrono::system_clock::now();
  std::string rand_bytes;
  for (long i=0; i<10000000; ++i)
      rand_bytes = std::move(rand_str(64));
  auto endTP = std::chrono::system_clock::now();

  std::cout << "This took: " << std::chrono::duration_cast<std::chrono::microseconds>(endTP-startTP).count() << std::endl;
}

This takes around 3 seconds on my machine. The trick is to call the random number generator as little as possible and to allocate the memory only once.

What I'm doing is converting randNumber from base 10 to base 93(the length of chars). After that im using every base 93 digit as a different random number. This provides around 5 numbers per generated random number.

A.Hristov
  • 481
  • 1
  • 4
  • 13
  • @t.niese this is incorrect. There are length^chars.size() number of possible string and the strings are only as predictable as the generetor – A.Hristov Aug 11 '20 at 13:49
  • This just breaks one `randNumb` into base-93 digits, Should be perfectly fine, provided `generator.min() == 0` and (log `generator.max()` base 93) >= `length`. – lvella Aug 11 '20 at 14:02
  • 1
    The way you've written that, I'd expect that a bias against `a`. – rici Aug 11 '20 at 14:21
  • This is badly biased (not just against `a` but in favor of `bcd`). – Ben Voigt Aug 11 '20 at 14:47
  • That's true. Didn't really think about it. It could be avoided if in the condition for the while loop you put randNumb > 93. Then there should be no bias (or negligable bias since technically UINT_MAX%93 = 3 but there is about 1/1e9 chance any of these 3 numbers will be chosen which in the grand scheme of things doesn't matter.) – A.Hristov Aug 11 '20 at 15:04
  • 1
    @A.Hristov: You're still underestimating the bias, which is easily observable in your corrected version, amounting to a difference of about 0.45% in selection frequency. (It's a lot better than the original version, in which it was closer to 40%.) Your analysis above only considers the last base-93 digit of the four extracted from a random number. But the bias in the first digit is much greater: almost 2%. Also, if the first digit happens to be 0 ('a') and whatever is left over happens to be 0 (with probability of about 2%), then the `a` will be discarded. So [b-L] are the most common. – rici Aug 12 '20 at 17:07
  • @rici The problem with `a` could be avoided if I'd written >= 93 (didn't really put much thought into it) but I don't understand why there is still so much bias and why such an arbitrary selection [b-L] – A.Hristov Aug 12 '20 at 17:23
  • 1
    @A.Hristov: After you divide by 93 three times, there are 5340 possible residues (one of which has a lower probability than the others, but that's a relatively minor detail.) Those 5340 residues correspond to 57 complete ranges of 0-92, plus one incomplete range of 0-38. So values 0-38 are produced 58 times to every 57 times a value of 39-92 is produced. (It's actually 57.6 because of the incomplete range, making value 38 intermediate.) – rici Aug 12 '20 at 17:29
  • @rici Nice explanation. I get it now – A.Hristov Aug 12 '20 at 17:33
  • This is definetely broken because my memory is 100mb+ and it says debug assertion failed or something in visual studio 2022 – Felierix Jan 27 '22 at 10:02