2

I'm currently running this code:

for dicword in dictionary:
    for line in train:
        for word in line:
            if dicword == word:
                pWord[i] = pWord[i] + 1
    i = i + 1

Where dictionary and pWord are a 1D lists of the same size, and train is a 2D list.

Both dictionary and train are very large, and the code executes slowly.

How can I optimize this particular piece of code and code like this in general?

Edit: train is a list containing about 2000 lists, which in turn each contains individual words pulled from a document. dictionary was created by pulling each unique word from all of train.

Here is the creation of dictionary:

dictionary = []
for line in train:
    for word in line:
        if word not in dictionary:
            dictionary.append(word)

Edit 2: Sample of the content in each list:

[ ... , 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', ...]
Alexander
  • 105,104
  • 32
  • 201
  • 196
NGXII
  • 407
  • 2
  • 9
  • 18
  • This really depends on your actual data, can you expand on what the actual data is? –  Feb 17 '16 at 04:20
  • 1
    Is every `word` in the `line` of the list `train` a potential `dicword` in `dictionary`? We need more information about the lists you are going through. – aug Feb 17 '16 at 04:21
  • added the edits to make this more clear, though it sounds like you're right about the format aug. When I refer to the words i mean just a string with no spaces – NGXII Feb 17 '16 at 04:24
  • 1
    Where is your list sample? – MLSC Feb 17 '16 at 04:26
  • An example with three words from your dictionary would greatly improve the quality of the resonses. – Alexander Feb 17 '16 at 04:27
  • Step 1: Make `dictionary` a set instead of a list. Step 2: Eliminate the outermost loop in the first code sample and just use `if word in dictionary`. – Kevin Feb 17 '16 at 04:28
  • Updated as per your recommendations, MLSC and Alexander – NGXII Feb 17 '16 at 04:32

2 Answers2

1

You can use Counter.

from collections import Counter

train = [["big", "long", "list", "of", "big", "words"], 
         ["small", "short", "list", "of", "short", "words"]]

c = Counter(word for line in train for word in line)

>>> c
Counter({'big': 2,
         'list': 2,
         'long': 1,
         'of': 2,
         'short': 2,
         'small': 1,
         'words': 2})

Note that the counter itself is constructed using a generator expression (aka generator comprehension).

Also note that you don't even need to create a dictionary. It is created for you via Counter.

You can then use a dictionary comprehension to get the most common words, e.g. top 5:

>>> {word: count for word, count in c.most_common(5)}
{'big': 2, 'list': 2, 'of': 2, 'short': 2, 'words': 2}
Community
  • 1
  • 1
Alexander
  • 105,104
  • 32
  • 201
  • 196
0

optimize Python for loops in general?

A good strategy for helping out with processing lists of very many elements is to use generators (see also python docs on generators). If you're streaming through a large list, transforming the elements or aggregating them, you may not need them all in memory at a given time.

Brian Cain
  • 14,403
  • 3
  • 50
  • 88