-2

I have a list with 155k files. When I random.sample(list, 100), while the results are not the same from the previous sample, they look similar.

Is there a better alternative to random.sample that returns a new list of random 100 files?

folders = get_all_folders('/data/gazette-txt-files')
# get all files from all folders
def get_all_files():
    files = []
    for folder in folders:
        files.append(glob.glob("/data/gazette-txt-files/" + folder + "/*.txt"))

    # convert 2D list into 1D
    formatted_list = []
    for file in files:
        for f in file:
            formatted_list.append(f)

    # 200 random text files
    return random.sample(formatted_list, 200)
Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
Zacharia Mwangi
  • 53
  • 3
  • 10
  • 2
    The whole `random` library is pseudo-random. Short of half-life decay idk what else is "truly" random besides maybe network noise. – Ryan Haining Oct 07 '16 at 17:18
  • 6
    Usually, the main problem with randomness is the human perception of what is random is completely wrong. We keep seeing “unrandom” patterns in perfectly random signals. That's just how our brain works. – spectras Oct 07 '16 at 17:21
  • 1
    If you want more assured randomness, instantiate `SystemRandom()`. Still, you might just have to trust that python's randomness is pretty good at this point, considering that if there were any problems with it, it would've been addressed long ago. – Random Davis Oct 07 '16 at 17:26
  • What do you mean by "look similar"? By the Birthday Paraox there is a nontrivial chance of hitting at least one file you hit on the previous sample, on the order of 1-exp(-(100)^2 / (2*155000)) = 3.2% – Paul Oct 07 '16 at 17:37

2 Answers2

3

For purposes like randomly selecting elements from a list, using random.sample suffices, true randomness isn't provided and I'm unaware if this is even theoretically possible.

random (by default) uses a Pseudo Random Number Generator (PRNG) called Mersenne Twister (MT) which, although suitable for applications such as simulations (and minor things like picking from a list of paths), shouldn't be used in areas where security is a concern due to the fact that it is deterministic.

This is why Python 3.6 also introduces secrets.py with PEP 506, which uses SystemRandom (urandom) by default and is capable of producing cryptographically secure pseudo random numbers.

Of course, bottom line is, that even if you use a PRNG or CPRNG to generate your numbers they're still going to be pseudo random.

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
-1

You may need to seed the generator. See here in the Documentation.

Just call random.seed() before you get the samples.

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
Tammo Heeren
  • 1,966
  • 3
  • 15
  • 20