0

I have a list of words. It's pretty large (len(list) ~ 70,000). I'm currently using this code:

replacement = "bla"
for word in data:
    if (word in unique_words):
        word = replacement

This code take a while to perform the operation. Is there a quicker way to do this?

Archer2486
  • 399
  • 2
  • 4
  • 14
  • 2
    How you optimize this depends on your needs. Does data need to preserve it's order? If no, then use a set or a Counter. Do you need all the results at one time? If no, then use a generator. Can data be broken up into smaller groups? If so, then use a dict of dict or some other higher order structure, so you only work with part of the dataset at one time. – Jeffery Thomas May 26 '12 at 13:23

2 Answers2

6

Use a set for unique_words. Sets are considerably faster than lists for determining if an item is in them (see Python Sets vs Lists ).

Also, it's only a stylistic issue but I think you should drop the brackets in the if. It looks cleaner.

Community
  • 1
  • 1
jamylak
  • 128,818
  • 30
  • 231
  • 230
  • Could you explain why the brackets would make the performance slower? I don't see why that would be. – acattle May 26 '12 at 13:14
  • 1
    They won't, but they look ugly. – jamylak May 26 '12 at 13:14
  • Oh man... Can't believe I haven't thought of that. I'm pretty new to Python so I wasn't aware of the no-brackets-in-an-If thing. I just saw some code once (of someone who obviously made the same mistake) and figured that's the way to go. Anyway, thanks for the help! – Archer2486 May 26 '12 at 13:15
4

The code you have posted doesn't actually do any replacement. Here is a snippet that does:

for key,word in enumerate(data):
   if word in unique_words:
       data[key] = replacement

Here's a more compact way:

new_list = [replacement if word in unique_words else word for word in big_list]

I think unique_words is an odd name for the variable considering its use, perhaps it should be search_list?

Edit:

After your comment, perhaps this is better:

from collections import Counter
c = Counter(data)
only_once = [k for k,v in c.iteritems() if v == 1]

# Now replace all occurances of these words with something else

for k, v in enumerate(data):
    if v in only_once:
        data[k] = replacement
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • My code is supposed to replace all the words that appear only once in a training set of words with a constant *replacement* in the test set of words. For that reason the *unique_word* name makes sense. Thanks for the additional suggestions! – Archer2486 May 26 '12 at 13:41
  • That part you added in your edit it pretty nice. I'm gonna use it. Thanks! – Archer2486 May 26 '12 at 17:52