How can I optimize this code? And how to optimize Python for loops in general?

Question

I'm currently running this code:

for dicword in dictionary:
    for line in train:
        for word in line:
            if dicword == word:
                pWord[i] = pWord[i] + 1
    i = i + 1

Where dictionary and pWord are a 1D lists of the same size, and train is a 2D list.

Both dictionary and train are very large, and the code executes slowly.

How can I optimize this particular piece of code and code like this in general?

Edit: train is a list containing about 2000 lists, which in turn each contains individual words pulled from a document. dictionary was created by pulling each unique word from all of train.

Here is the creation of dictionary:

dictionary = []
for line in train:
    for word in line:
        if word not in dictionary:
            dictionary.append(word)

Edit 2: Sample of the content in each list:

[ ... , 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', ...]

This really depends on your actual data, can you expand on what the actual data is? — , Feb 17 '16 at 04:20
Is every `word` in the `line` of the list `train` a potential `dicword` in `dictionary`? We need more information about the lists you are going through. — aug, Feb 17 '16 at 04:21
added the edits to make this more clear, though it sounds like you're right about the format aug. When I refer to the words i mean just a string with no spaces — NGXII, Feb 17 '16 at 04:24
An example with three words from your dictionary would greatly improve the quality of the resonses. — Alexander, Feb 17 '16 at 04:27
Step 1: Make `dictionary` a set instead of a list. Step 2: Eliminate the outermost loop in the first code sample and just use `if word in dictionary`. — Kevin, Feb 17 '16 at 04:28

score 1 · Answer 1 · edited May 23 '17 at 11:50

1

You can use Counter.

from collections import Counter

train = [["big", "long", "list", "of", "big", "words"], 
         ["small", "short", "list", "of", "short", "words"]]

c = Counter(word for line in train for word in line)

>>> c
Counter({'big': 2,
         'list': 2,
         'long': 1,
         'of': 2,
         'short': 2,
         'small': 1,
         'words': 2})

Note that the counter itself is constructed using a generator expression (aka generator comprehension).

Also note that you don't even need to create a dictionary. It is created for you via Counter.

You can then use a dictionary comprehension to get the most common words, e.g. top 5:

>>> {word: count for word, count in c.most_common(5)}
{'big': 2, 'list': 2, 'of': 2, 'short': 2, 'words': 2}

edited May 23 '17 at 11:50

Community

1
1

answered Feb 17 '16 at 04:38

Alexander

105,104
32
201
196

Unfortunately, for this project I can't import any extra code, I have to code it all with base Python.. – NGXII Feb 17 '16 at 23:31
Counter is part of the Python standard library (see link in post). Have you tried it? – Alexander Feb 17 '16 at 23:36

score 0 · Answer 2 · answered Feb 17 '16 at 04:27

optimize Python for loops in general?

A good strategy for helping out with processing lists of very many elements is to use generators (see also python docs on generators). If you're streaming through a large list, transforming the elements or aggregating them, you may not need them all in memory at a given time.

How can I optimize this code? And how to optimize Python for loops in general?

2 Answers2