3

I have read this answer potentially as the best way to randomize a list of strings in Python. I'm just wondering then if that's the most efficient way to do it because I have a list of about 30 million elements via the following code:

import json
from sets import Set
from random import shuffle

a = []

for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    for j in range(0,len(data)):
        a.append(data[j]['su'])
new = list(Set(a))
print "Cleaned length is: " + str(len(new))

## Take Cleaned List and Randomize it for Analysis
shuffle(new)

If there is a more efficient way to do it, I'd greatly appreciate any advice on how to do it.

Thanks,

Community
  • 1
  • 1
eWizardII
  • 1,916
  • 4
  • 32
  • 55

3 Answers3

4

A couple of possible suggestions:

import json
from random import shuffle

a = set()
for i in range(193):
    with open("C:/Twitter/user/user_{0}.json".format(i)) as json_data:
        data = json.load(json_data)
        a.update(d['su'] for d in data)

print("Cleaned length is {0}".format(len(a)))

# Take Cleaned List and Randomize it for Analysis
new = list(a)
shuffle(new)

.

  • the only way to know if this is faster is to profile it!
  • do you prefer sets.Set to the built-in set() for a reason?
  • I have introduced a with clause (preferred way of opening files, as it guarantees they get closed)
  • it did not appear that you were doing anything with 'a' as a list except converting it to a set; why not make it a set from the start?
  • rather than iterate on an index, then do a lookup on the index, I just iterate on the data items...
  • which makes it easily rewriteable as a generator expression
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
  • Thanks for the advice, how will itervalues work since I thought data is a list? And not a dict - which also seems to be the problem when I run it: `AttributeError: 'list' object has no attribute 'itervalues'` – eWizardII Jan 08 '11 at 04:33
  • 1
    He meant, 'a.update(d['su'] for d in data)' the method `.itervalues` is for dictionaries. Basically there is no reason for you to use `range` here anyways. – milkypostman Jan 08 '11 at 04:52
2

If you think you're going to do shuffle, you're probably better off using the solution from this file. For realz.

randomly mix lines of 3 million-line file

Basically the shuffle algorithm has a very low period (meaning it can't hit all the possible combinations of 3 million files, let alone 30 million). If you can load the data in memory then your best bet is as they say. Basically assign a random number to each line and sort that badboy.

See this thread. And here, I did it for you so you didn't mess anything up (that's a joke),

import json
import random
from operator import itemgetter

a = set()
for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    a.update(d['su'] for d in data)

print "Cleaned length is: " + str(len(new))

new = [(random.random(), el) for el in a]
new.sort()
new = map(itemgetter(1), new)
Community
  • 1
  • 1
milkypostman
  • 2,955
  • 27
  • 23
  • I'd rather use the "key" parameter of the list.sort method instead of building that "new", list... something like: new = sorted(a, key=lambda x: random.random()) – Bakuriu Jan 08 '11 at 16:38
  • Two problems with your idea, 1) lambda is really slow, 2) the sort method is going to do exactly what I've suggested in the end. Basically when you supply `key` to the sort algorithm, all it does is create `(key, object)` tuples. – milkypostman Jan 08 '11 at 17:58
0

I don't know if it will be any faster but you could try numpy's shuffle.

Fragsworth
  • 33,919
  • 27
  • 84
  • 97