7

For a text classification project (age) I'm making a subset of my data. I've made 3 lists with filenames, sorted by age. I want to shuffle these lists and then append 5000 filenames from each shuffled list to a new list. The result should be a data subset with 15000 files (5000 10s, 5000 20s, 5000 30s). Beneath you can see what I wrote so far. But I know that random.shuffle returns none and a none type object is not iterable. How can I solve this problem?

def seed():
   return 0.47231099848

teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
data = []
for categorie in random.shuffle([teens, tweens, thirthies],seed):
    data.append(teens[:5000])
    data.append(tweens[:5000])
    data.append(thirthies[:5000])
mforpe
  • 1,549
  • 1
  • 12
  • 22
Bambi
  • 715
  • 2
  • 8
  • 19

4 Answers4

9

First problem is that you are shuffling the list consisting of the 3 items [teens, tweens, thirthies] (even each of the item is a list) instead of shuffling each sublist

Second, you may use random.sample instead of random.shuffle

for categ in [teens, tweens, thirthies]:
    data.append(random.sample(categ,5000])

or as @JonClements suggested in the comments you can use the list comprehension

categories = [teens, tweens, thirthies]
data = [e for categ in categories for e in random.sample(categ, 5000)]
Luchko
  • 1,123
  • 7
  • 15
  • @JonClements the list comprehension gave me exactely what I wanted: a list of 15000 randomly chosen files, with 5000 per category. This really helped me – Bambi Apr 23 '17 at 12:11
  • @JonClements every time you run the list comprehension, the samples change, I guess. My next step is to make a list with info I take from the files in the subset list. But when I run a for-loop on the subset list, I now get a name error: name 'data' is not defined – Bambi Apr 23 '17 at 12:23
  • 2
    to get the same samples each time simply add ``random.seed(0.47231099848)`` after your ``import random`` as explained by @PM 2Ring – Luchko Apr 23 '17 at 12:30
7

You're correct that random.shuffle returns None. That's because it shuffles its list argument in-place, and it's a Python convention that functions which take a mutable arg and mutate it return None. However, you misunderstand the random arg to random.shuffle: it needs to be a random number generator, not a function like your seed that always returns the same number.

BTW, you can seed the standard random number generator provided by the random module using its seed function. random.seed accepts any hashable object as its argument, although it's customary to pass it a number or string. You can also pass it None (which is equivalent here to not passing it an arg at all), and it will seed the randomiser with the system random source (if there isn't a system random source then the system time is used as the seed). If you don't explicitly call seed after importing the random module, that's equivalent to calling seed()

The benefit of supplying a seed is that each time your run the program with the same seed the random numbers produced by the various random module functions will be exactly the same. This is very useful while developing and debugging your code: it can be hard to track down errors when the output keeps on changing. :)


There are two main ways to do what you want. You can shuffle the lists and then slice the first 5000 file names from them. Or you can use the random.sample function to take 5000 random samples. That way you don't need to shuffle the whole list.

import random

random.seed(0.47231099848)

# teens, tweens, thirties are lists of file names
file_lists = [teens, tweens, thirties]

# Shuffle
data = []
for flist in file_lists:
    random.shuffle(flist)
    data.append(flist[:5000])

Using sample

# Sample
data = []
for flist in file_lists:
    data.append(random.sample(flist, 5000))

I haven't performed speed tests on this code, but I suspect that sample will be faster, since it just need to randomly select items rather than moving all the list items. shuffle is fairly efficient, so you probably wouldn't notice much difference in the run time unless your teens, tweens, and thirties file lists each have a lot more than 5000 file names.

Both of those loops make data a nested list containing 3 sublists, with 5000 file names in each sublist. However, if you want it to be a flat list of 15000 file names you just need to use the list.extend method instead of list.append. Eg,

data = []
for flist in file_lists:
    data.extend(random.sample(flist, 5000))

Or we can do it using a list comprehension with a double for loop:

data = [fname for flist in file_lists for fname in random.sample(flist, 5000)]

If you need to filter the contents of data to build your final file list, the simplest way is to add an if condition to the list comprehension.

Let's say we have a function that can test whether a file name is one we want to keep:

def keep_file(fname):
    # if we want to keep fname, return True, otherwise return False

Then we can do

data = [fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname)]

and data will only contain the file names that pass the keep_file test.

Another way to do it is to create the file names using a generator expression instead of a list comprehension and then pass that to the built-in filter function:

data_gen = filter(keep_file, (fname for flist in file_lists for fname in random.sample(flist, 5000)))

data_gen is itself an iterator. You can build a list from it like this:

data_final = list(data_gen)

Or if you don't actually need all the names as a collection and you can just process them one by one, you can put it in a for loop, like this:

for fname in data_gen:
    print(fname)
    # Do other stuff with fname

This uses less RAM, but the downside is that it "consumes" the file names, so once the for loop is finished data_gen will be empty.

Let's assume that you've written a function that extracts the desired data from each file:

def age_and_text(fname):
    # Do stuff that extracts the age and desired text from the file
    return fname, age, text

You could create a list of those (filename, age, text) tuples like this:

data_gen = (fname for flist in file_lists for fname in random.sample(flist, 5000) if keep_file(fname))

final_data = [age_and_text(fname) for fname in data_gen]

Notice the slice in my first snippet: flist[:5000]. That takes the first 5000 items in flist, the items with indices 0 to 4999 inclusive. Your version had teens[:5001], which is an off-by-one error. Slices work the same way as ranges. Thus range(5000)yields the 5000 numbers from 0 to 4999. It works this way because Python (like most modern programming languages) uses zero-based indexing.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Is it correct that when I execute that code, the result is a nested list consisting of a list per categorie? Because `print(len(data))` gives 3 as a result. – Bambi Apr 23 '17 at 12:04
  • @Lorien Yes, it creates a nested list containing 3 sublists, with 5000 filenames in each sublist. I assumed you wanted that because of the code you posted in your question. But it's very easy to change it so it produces a flat list of 15000 file names. I'll add some more code to my answer. – PM 2Ring Apr 23 '17 at 12:22
  • Yes, a flat list would be better I guess, because I have to loop over it again to make a definite data list which consists of the info I take out of the files. – Bambi Apr 23 '17 at 12:31
  • thank you, but my files are xml files, so the final list should be a list that takes the age and text out of the xml tree per file and puts it in a list. That's the next step for me to figure out – Bambi Apr 23 '17 at 13:02
  • @Lorien You can still use my code, you just need to put the correct logic in the `keep_file` function. If you need help writing the XML processing code you should ask a fresh question. Your new question can contain a link to this one (and vice versa) to let people know that it's a follow-on question. – PM 2Ring Apr 23 '17 at 13:10
  • 1
    @Lorien I guess you want the final list to be a list of (filename, age, text) tuples. Once you've extracted the age and text data, that's pretty easy to do, as my latest update shows. – PM 2Ring Apr 23 '17 at 13:26
  • 1
    Careful with seeding the RNG with arbitrary objects and expecting the same results: things change when you change Python versions or CPU architectures (and some hashes -- e.g. of strings -- are randomized in Python 3). – Marius Gedminas May 02 '17 at 10:29
  • @MariusGedminas Good points. Notice that I said that you _can_ use any hashable object as the seed, but I didn't say it was a good idea. ;) FWIW, I'm not happy that you can't run the same code on Python 2 & 3 that uses a fixed string or integer seed and get the same results, as I mentioned [here](http://stackoverflow.com/a/41955452/4014959). – PM 2Ring May 02 '17 at 10:41
5

shuffle returns None, which is not iterable

you should do

data = []
for category in [teens, tweens, thirthies]:
    category_copy = category[:]
    random.shuffle(category_copy, seed)
    data.append(category_copy[:5000])
Azat Ibrakov
  • 9,998
  • 9
  • 38
  • 50
  • 1
    this will append `5001` elements from each list! Use `category_copy[:5000]` instead – Ivan Borshchov Apr 23 '17 at 11:47
  • 1
    There's no need to copy the lists, since the OP doesn't mind if the original lists are shuffled. OTOH, it's probably more efficient to use `sample` rather than `shuffle`. – PM 2Ring Apr 23 '17 at 13:13
1

random.shuffle changes list itself (makes it shuffled). So looks like you want something like this:

teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
random.shuffle(teens)
random.shuffle(tweens)
random.shuffle(thirthies)
data = []
for categorie in [teens, tweens, thirthies] :
    data.append(categorie[:5000])

BTW somelist[:n] will be truncated to n elements, check this:

>>> [1,2,3,4,5][:3]
[1, 2, 3]
Make Tips
  • 180
  • 2
  • 6