Random List of millions of elements in Python Efficiently

Question

I have read this answer potentially as the best way to randomize a list of strings in Python. I'm just wondering then if that's the most efficient way to do it because I have a list of about 30 million elements via the following code:

import json
from sets import Set
from random import shuffle

a = []

for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    for j in range(0,len(data)):
        a.append(data[j]['su'])
new = list(Set(a))
print "Cleaned length is: " + str(len(new))

## Take Cleaned List and Randomize it for Analysis
shuffle(new)

If there is a more efficient way to do it, I'd greatly appreciate any advice on how to do it.

Thanks,

Hugh Bothwell · Answer 1 · 2011-01-08T12:13:17.927

A couple of possible suggestions:

import json
from random import shuffle

a = set()
for i in range(193):
    with open("C:/Twitter/user/user_{0}.json".format(i)) as json_data:
        data = json.load(json_data)
        a.update(d['su'] for d in data)

print("Cleaned length is {0}".format(len(a)))

# Take Cleaned List and Randomize it for Analysis
new = list(a)
shuffle(new)

.

the only way to know if this is faster is to profile it!
do you prefer sets.Set to the built-in set() for a reason?
I have introduced a with clause (preferred way of opening files, as it guarantees they get closed)
it did not appear that you were doing anything with 'a' as a list except converting it to a set; why not make it a set from the start?
rather than iterate on an index, then do a lookup on the index, I just iterate on the data items...
which makes it easily rewriteable as a generator expression

Thanks for the advice, how will itervalues work since I thought data is a list? And not a dict - which also seems to be the problem when I run it: `AttributeError: 'list' object has no attribute 'itervalues'` — eWizardII, Jan 08 '11 at 04:33
He meant, 'a.update(d['su'] for d in data)' the method `.itervalues` is for dictionaries. Basically there is no reason for you to use `range` here anyways. — milkypostman, Jan 08 '11 at 04:52

score 2 · Answer 2 · edited May 23 '17 at 12:10

If you think you're going to do shuffle, you're probably better off using the solution from this file. For realz.

randomly mix lines of 3 million-line file

Basically the shuffle algorithm has a very low period (meaning it can't hit all the possible combinations of 3 million files, let alone 30 million). If you can load the data in memory then your best bet is as they say. Basically assign a random number to each line and sort that badboy.

See this thread. And here, I did it for you so you didn't mess anything up (that's a joke),

import json
import random
from operator import itemgetter

a = set()
for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    a.update(d['su'] for d in data)

print "Cleaned length is: " + str(len(new))

new = [(random.random(), el) for el in a]
new.sort()
new = map(itemgetter(1), new)

I'd rather use the "key" parameter of the list.sort method instead of building that "new", list... something like: new = sorted(a, key=lambda x: random.random()) — Bakuriu, Jan 08 '11 at 16:38
Two problems with your idea, 1) lambda is really slow, 2) the sort method is going to do exactly what I've suggested in the end. Basically when you supply `key` to the sort algorithm, all it does is create `(key, object)` tuples. — milkypostman, Jan 08 '11 at 17:58

score 0 · Answer 3 · answered Jan 08 '11 at 02:42

0

I don't know if it will be any faster but you could try numpy's shuffle.

answered Jan 08 '11 at 02:42

Fragsworth

33,919
27
84
97

Random List of millions of elements in Python Efficiently

3 Answers3