You're correct that random.shuffle
returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return None
. However, you misunderstand the random
arg to random.shuffle
: it needs to be a random number generator, not a function like your seed
that always returns the same number.
BTW, you can seed the standard random number generator provided by the random module using its seed
function. random.seed
accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it None
(which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call seed
after importing the random module, that's equivalent to calling seed()
The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :)
There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample
function to take 5000 random samples. That way you don't need to shuffle the whole list.
import random
random.seed(0.47231099848)
# teens, tweens, thirties are lists of file names
file_lists = [teens, tweens, thirties]
# Shuffle
data = []
for flist in file_lists:
random.shuffle(flist)
data.append(flist[:5000])
Using sample
# Sample
data = []
for flist in file_lists:
data.append(random.sample(flist, 5000))
I haven't performed speed tests on this code, but I suspect that sample
will be faster, since it just need to randomly select items rather than moving all the list items. shuffle
is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names.
Both of those loops make data
a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the list.extend
method instead of list.append
. Eg,
data = []
for flist in file_lists:
data.extend(random.sample(flist, 5000))
Or we can do it using a list comprehension with a double for
loop:
data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]
If you need to filter the contents of data
to build your final file list, the simplest way is to add an if
condition to the list comprehension.
Let's say we have a function that can test whether a file name is one we want to keep:
def keep_file(fname):
# if we want to keep fname, return True, otherwise return False
Then we can do
data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]
and data
will only contain the file names that pass the keep_file
test.
Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in filter
function:
data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))
data_gen
is itself an iterator. You can build a list from it like this:
data_final = list(data_gen)
Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a for
loop, like this:
for fname in data_gen:
print(fname)
# Do other stuff with fname
This uses less RAM, but the downside is that it "consumes" the file names, so once the for
loop is finished data_gen
will be empty.
Let's assume that you've written a function that extracts the desired data from each file:
def age_and_text(fname):
# Do stuff that extracts the age and desired text from the file
return fname, age, text
You could create a list of those (filename, age, text)
tuples like this:
data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname))
final_data = [age_and_text(fname) for fname in data_gen]
Notice the slice in my first snippet: flist[:5000]
. That takes the first 5000 items in flist
, the items with indices 0 to 4999 inclusive. Your version had teens[:5001]
, which is an off-by-one error. Slices work the same way as ranges. Thus range(5000)
yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.