33

I'm trying to run several instances of a piece of code (2000 instances or so) concurrently in a computing cluster. The way it works is that I submit the jobs and the cluster will run them as nodes open up every so often, with several jobs per node. This seems to produce the same values for a good number of the instances in their random number generation, which uses a time-seed.

Is there a simple alternative I can use instead? Reproducibility and security are not important, quick generation of unique seeds is. What would be the simplest approach to this, and if possible a cross platform approach would be good.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
CHP
  • 895
  • 5
  • 13
  • 28
  • 1
    I'm unclear on the situation.Are you re-seeding – Brian Kennedy Oct 01 '11 at 01:42
  • What algorithm is being used for the pseudo-random number generation? i.e. lcg, Marsaglia, Mersenne Twister, etc... – Jason Strimpel Oct 01 '11 at 01:43
  • Re-seeding? There is one call to srand() when the compiled code runs. But over 100 concurrent instances of the compiled code run at the same time. Therefore, some of those 100 produce identical random numbers, since their srand must run at the same time. – CHP Oct 01 '11 at 01:44
  • you should use a gettimeofday call, and seed with the tv.tv_usec value. – Kevin Oct 01 '11 at 01:46
  • 1
    Uh, ignore that... it timed out or something... here's my real response: You could try GUIDs, mash them, and use that to seed. However, if that cluster is one machine, you may have a lot of common bits. What I might do is have them each request a seed from a single seed providing process... have it seed a random number based on the time. And then it just sits there waiting for seed requests, to which it responds with the next random number in its sequence. That should give you a nice and even distribution of seeds for your 2000 processes doing the real work. – Brian Kennedy Oct 01 '11 at 01:49
  • See if you can pregenerate the seed and use them as input arguments to your program. It'll be easier to generate 2000 unique seeds before you start the 2000 jobs. – nos Oct 01 '11 at 01:49
  • Brian, that would be extremely complicated to set up given the way the cluster is set up. Simple in a regular environment which you control, but not in this one where I can't run anything on it without very specific resource requests that get queued up and only run for the specific requested time, etc. Pre-generating seeds is more feasible, but requires an additional script or code to run which is kind of... inelegant, though I might end up doing it. The rdtsc() answer below seems interesting, I'll probably give that a shot first. – CHP Oct 01 '11 at 02:02
  • 1
    Use time, but add in something related to the local platform, such as the IP address or your current thread's task ID or some such. – Hot Licks Oct 01 '11 at 03:43
  • The right answer is the one that mentions C++11's `std::random_device` below to seed a random number generator. – EmeryBerger Aug 17 '17 at 19:59
  • That is not the right answer since this question is specifically tagged C and not C++, though that's good for people to know if they are using C++11 – CHP Aug 17 '17 at 21:52

10 Answers10

30

The rdtsc instruction is a pretty reliable (and random) seed.

In Windows it's accessible via the __rdtsc() intrinsic.

In GNU C, it's accessible via:

unsigned long long rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((unsigned long long)hi << 32) | lo;
}

The instruction measures the total pseudo-cycles since the processor was powered on. Given the high frequency of today's machines, it's extremely unlikely that two processors will return the same value even if they booted at the same time and are clocked at the same speed.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • I like this answer but can't try it out just yet. I'll give it a shot when the cluster login nodes come back up again and see how it works. Does this work per core? There's two processors per node, with 4 cores on each processor. Each job requests one core if I run my code in single-threaded mode, and up to 8 cores if I run in multi-threaded mode. – CHP Oct 01 '11 at 02:25
  • 2
    You can call this concurrently on the same processor and even the same core. Technically speaking, cores on the same processor will be somewhat synchronized, but that's still not going to be a problem because normal jitter will make the probability of two threads calling it on the same cycle nearly zero. (and even if they ARE called on the same cycle, there's plenty of reason to believe the processor will block one until the other is done - so it may even prevent them from ever returning the same value to two threads) – Mysticial Oct 01 '11 at 02:29
  • Finally got to try this, and it seems to have done the trick, 2000 simulations and not a single identical one :) I'll have to test this a few more times for assurance, but it seems to work. – CHP Oct 03 '11 at 16:52
  • Hey Mystical, Problem with me pressing the wrong darn arrow and only noticing it after 5 minutes! edit your post and I'll change that into a +1! – brice Apr 22 '12 at 17:55
5
unsigned seed;

read(open("/dev/urandom", O_RDONLY), &seed, sizeof seed);
srand(seed); // IRL, check for errors, close the fd, etc...

I would also recommend a better random number generator.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
5

If C++11 can be used then consider std::random_device. I would suggest you to watch link for a comprehensive guide.

Extracting the essential message from the video link : You should never use srand & rand, but instead use std::random_device and std::mt19937 -- for most cases, the following would be what you want:

#include <iostream>
#include <random>
int main() {
    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_int_distribution<int> dist(0,99);
    for (int i = 0; i < 16; i++) {
        std::cout << dist(mt) << " ";
    }
    std::cout << std::endl;
}
5

I assume you have some process launching the other processes. Have it pass in the seed to use. Then you can have that master process just pass in a random number for each process to use as its seed. That way there's really only one arbitrary seed chosen... you can use time for that.

If you don't have a master process launching the others, then if each process at least has a unique index, then what you can do is have one process generate a series of random numbers in memory (if shared memory) or in a file (if shared disk) and then have each process pull the index'th random number out to use as their seed.

Nothing will give you a more even distribution of seeds than a series of random numbers from a single seed.

Brian Kennedy
  • 3,499
  • 3
  • 21
  • 27
5

A combination of the PID and the time should be enough to get a unique seed. It's not 100% cross-platform, but getpid(3) on *nix platforms and GetProcessId on Windows will get you 99.9% of the way there. Something like this should work:

srand((time(NULL) & 0xFFFF) | (getpid() << 16));

You could also read data from /dev/urandom on *nix systems, but there's no equivalent to that on Windows.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • 1
    I'm not a Windows programmer, but I think [CryptGenRandom](http://msdn.microsoft.com/en-us/library/windows/desktop/aa379942%28v=vs.85%29.aspx) is the approximate Windows equivalent of /dev/(u)random on *nix. – Ilmari Karonen Oct 01 '11 at 06:57
  • Quickly `#include `; just in case it was not possible to follow the link you provide [getpid(3)](http://linux.die.net/man/3/getpid). – Hastur Nov 17 '15 at 10:38
1

Windows

Provides CryptGenRandom() and RtlGenRandom(). They will give you an array of random bytes, which you can use as seeds.

You can find the docs on the msdn pages.

Linux / Unixes

You can use Openssl's RAND_bytes() to get a random number of bytes on linux. It will use /dev/random by default.

Putting it together:

#ifdef _WIN32
  #include <NTSecAPI.h>
#else
  #include <openssl/rand.h> 
#endif

uint32_t get_seed(void)
{
  uint32_t seed = 0;

#ifdef _WIN32
  RtlGenRandom(&seed, sizeof(uint32_t) );
#else
  RAND_bytes(&seed, sizeof(uint32_t) ); 
#endif

  return seed;
}

Note that openssl provides a Cryptographically secure PRNG by default, so you could use it directly. More info here.

Community
  • 1
  • 1
brice
  • 24,329
  • 7
  • 79
  • 95
1

Instead of straight time as measured in seconds from the C std lib time() function, could you instead use the processor's counter? Most processors have a free running tick count, for example in x86/x64 there's the Time Stamp Counter:

The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium. It counts the number of ticks since reset.

(That page also has many ways to access this counter on different platforms -- gcc/ms visual c/etc)

Keep in mind that the timestamp counter is not without flaws, it may not be synced across processors (you probably don't care for your application). And power saving features may clock up or down the processor (again you probably don't care).

Community
  • 1
  • 1
Doug T.
  • 64,223
  • 27
  • 138
  • 202
1

Just an idea... generate a GUID (which is 16 bytes) and sum its 4-byte or 8-byte chunks (depending on the expected width of the seed), allowing integer wrap-around. Use the result as a seed.

GUIDs typically encapsulate characteristics of the computer that generated them (such as MAC address), which should make it rather improbable that two different machines will end-up generating the same random sequence.

This is obviously not portable, but finding appropriate APIs/libraries for your system should not be too hard (e.g. UuidCreate on Win32, uuid_generateon Linux).

Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
0

Assuming you're on a reasonably POSIX-ish system, you should have clock_gettime. This will give the current time in nanoseconds, which means for all practical purposes it's impossible to ever get the same value twice. (In theory bad implementations could have much lower resolution, e.g. just multiplying milliseconds by 1 million, but even half-decent systems like Linux give real nanosecond results.)

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
0

If uniqueness is important, you need to arrange for each node to know what IDs have been claimed by others. You could do this with a protocol asking "anyone claimed ID x?" or arranging in advance for each node to have a selection of IDs which have not been allocated to others.

(GUIDs use the machine's MAC, so would fall into the "arrange in advance" category.)

Without some form of agreement, you'll risk two nodes climing the same ID.

billpg
  • 3,195
  • 3
  • 30
  • 57