8

I'm looking for a way to randomly select a file from a tree of directories in a manner such that any individual file has exactly the same probability of being chosen as all other files. For example in the following tree of files, each file should have a 25% chance of being chosen:

  • /some/parent/dir/
    • Foo.jpg
    • sub_dir/
      • Bar.jpg
      • Baz.jpg
      • another_sub/
        • qux.png

My interim solution which I'm using while I code the rest of the app is to have a function like so:

def random_file(dir):
    file = os.path.join(dir, random.choice(os.listdir(dir)));
    if os.path.isdir(file):
        return random_file(file)
    else:
        return file

However this obviously biases the results depending on where they are in the tree and how many siblings are along side them in their directory so they end up with the following probabilities of being selected:

  • /some/parent/dir/
    • Foo.jpg - 50%
    • sub_dir/ (50%)
      • Bar.jpg - 16.6%
      • Baz.jpg - 16.6%
      • another_sub/ (16.6%)
        • qux.png - 16.6%

The context for the function is in a background rotation app I'm writing, so the ability to filter out unwanted file extensions from being in the results would be a bonus (although I could simply force that by choosing again if it's not the file type I want... that gets messy if there's an abundance of files of the "wrong" type, though).

Graham Lyon
  • 226
  • 3
  • 11
  • 1
    Do you want all pictures to be in one "run" once? You could just iterate through all files, add them to a list and shuffle that list once. – Jacob Jun 20 '11 at 13:26

3 Answers3

14

You can only select all files with the same probability if you know the total number of files in advance, so you need to create a full list first:

files = [os.path.join(path, filename)
         for path, dirs, files in os.walk(dir)
         for filename in files
         if not filename.endswith(".bak")]
return random.choice(files)
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Looks good, thanks! In the end I went with a slight alteration to get a white-list of allowed file extensions rather than a single disallowed one, adding an "extensions" argument to the method and then changing the if to: `if os.path.splitext(file)[1] in extensions` – Graham Lyon Jun 20 '11 at 14:03
  • This overstates the case a bit. It is entirely possible to select from all files with equal likelihood without first knowing the total number of files. – Michael J. Barber Jun 20 '11 at 14:13
  • 1
    @Michael: Care to explain this a bit further? – Sven Marnach Jun 20 '11 at 14:19
  • @Sven Marnach - Sure. For `n` items, it is either equal selection among the first `n-1` items, or the `n`th item with probability `1/n`. This can be computed as you run through the list of possibilities, requiring more random numbers but no full list. It's equivalent to reservoir sampling of a single item. I'll post it as an answer with code. – Michael J. Barber Jun 20 '11 at 14:35
  • @Michael: This seems to be about the meaning of the word "first". Define "first" as "before ultimately choosing the file". In the last iteration of your algorithm, you will also need to know the total number of files. You just do a lot of work before the final choice, so it seems you don't need to know it "first". – Sven Marnach Jun 20 '11 at 14:51
  • @Sven That would be a rather peculiar usage of "first", I'd say. More than that, you clearly do not "need to create a full list" at all. At any rate, this is just about wording rather than the essentials of the solution. To be clear, I'd do exactly what you and Mr E have given as answers, unless there were clear reason to take another approach---I have never worked in a system so memory constrained that I'd be unable to build a list of file names! – Michael J. Barber Jun 20 '11 at 15:10
  • @Michael: I was repsonding to "It is entirely possible to select from all files with equal likelihood without first knowing the total number of files." In the context of this statement, the most natural meaning of "first" is "before selecting". To make your choice, you need to know the total number of files, and to know the total number of files, you have to list them all. Of course you don't need to store the list in memory (and I never said so). – Sven Marnach Jun 20 '11 at 15:25
5

As other answers mentioned, you can choose one at random either by collecting all file paths into a list and using random.choice. Alternatively, an online selection is possible that uses no extra memory by using more random numbers. For n items, it is either equal selection among the first n-1 items, or the nth item with probability 1/n. This can be computed as you run through the list of possibilities.

You can iterate through all file names with:

def recursive_files(dir):
    for path, _, fnames in os.walk(dir):
        for fname in fnames:
            yield os.path.join(path, fname)

And make the online selection with this:

import random
def online_choice(iterable):
    for n, x in enumerate(iterable, 1):
        if random.randrange(n) == 0:
            pick = x
    return pick
Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88
  • 1
    This looks very interesting - I assume the win here for me is I can stop recursing through the directories as soon as I find one that "wins"? If so, why don't you just have `return x` inside the if? Surely the way you've got it, it just carries on iterating regardless of the fact that it's already picked? Unfortunately "online selection" is predictably bad as a google search :/ is there a good reference online that I can read up more on this? – Graham Lyon Jun 20 '11 at 15:02
  • @Graham Just the opposite, you have to go through them all. It is only better in terms of memory usage. Definitely use Sven's program for practical use. For more on this approach, I'd suggest looking for "reservoir sampling". More interesting is to view this approach as an application of conditional probabilities in a fashion similar to dynamic programming. Unfortunately, I don't have a reference for that more general view; I rather wish I could remember where I first read about it, but it was too many years ago. – Michael J. Barber Jun 20 '11 at 15:27
  • I certainly found this (and the pointer to "reservoir sampling") interesting, so +1. – Sven Marnach Jun 20 '11 at 15:33
  • @Sven Glad you found it interesting. Sometimes (not often!) a bit of pedantry can give a nice result. :) – Michael J. Barber Jun 20 '11 at 15:37
  • should pick be initialized to be next(iterable)? – Fu Jiantao Feb 09 '20 at 22:56
3

Why don't you just store all the files in one list (with their path) and choose from that?

YXD
  • 31,741
  • 15
  • 75
  • 115