0

I'm having a problem making a vocabulary of words in python. My code goes through every word in a document of about 2.3MB and checks whether or not the word is in the dictionary, if it is not, it appends to the list

The problem is, it is taking way to long (I havent even gotten it to finish yet). How can I solve this?

Code:

words = [("_", "hello"), ("hello", "world"), ("world", "."), (".", "_")] # List of a ton of tuples of words
vocab = []
for w in words:
    if not w in vocab:
        vocab.append(w)
N. Chalifour
  • 33
  • 1
  • 8

2 Answers2

3

Unless you need vocab to have a particular order, you can just do:

vocab = set(words)
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
2

The following is a test to compare the execution time of for loop and set():

import random
import time
import string


words = [''.join(random.sample(string.letters, 5)) for i in range(1000)]*10  # *10 to make duplicates!

vocab1 = []

t1 = time.time()
for w in words:
    if w not in vocab1:
        vocab1.append(w)
t2 = time.time()

t3 = time.time()
vocab2 = set(words)
t4 = time.time()

print t2 - t1
print t4 - t3

Output:

0.0880000591278  # Using for loop
0.000999927520752  # Using set()
ettanany
  • 19,038
  • 9
  • 47
  • 63