Increasing speed/performance in counting through very large lists in Python

Question

I'm writing a program in Python 3 and part of it's functionality is to find out what word occurs the most in list and to return the number of occurrences of that word. I have code that works, but part of the requirement is that it take a list of 200,000+ words and complete this activity in under a few seconds, and my code takes a really long time to run. I was wondering what suggestions you may have for speed improvements in this method.

def max_word_frequency(words):
    """A method that takes a list and finds the word with the most
    occurrences and returns the number of occurences of that word
    as an integer.
    """
    max_count = 0
    for word in set(words):
        count = words.count(word)
        if count > max_count:
            max_count = count

    return max_count

I have contemplated using a dictionary as they are hashable and super speedy compared to lists but I cannot quite figure out how to implement this yet.

Thank you for your time everyone!
- Finn

Maxime Lorant · Accepted Answer · 2014-05-29T11:51:24.677

First, your algorithm is looping m times over the whole list of 200 000 words, where m is the number of distinct words in this list. This is really not a good idea for just counting iterations of word and select the maximum. I could show you a more efficient algorithm (which could only iterate one time over the list), but Python already has tools to do what you want.

To solve your problem with few lines of code, you can use Python algorithm available in the standard library, which have been implemented in C and might be more efficient than your loop. The Counter class with its most_common method might help you:

>>> from collections import Counter
>>> counts = Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo'])
>>> counts
Counter({'foo': 3, 'abc': 2, 'bar': 1, 'def': 1})
>>> Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo']).most_common(1)
[('foo', 3)]

The you just have to return the second element of the tuple (there is only one tuple here, as we ask by the 1 argument in most_common)

Performance comparaison

Just to compare, I took a sample of a LaTeX file (~12Ko), split words by spaces (giving the x with 1835 words) and run your function and the one below with timeit. You can see a real gain.

>>> len(x)
1835
>>> def max_word_2(words):
...     counts = Counter(words)
...     return counts.most_common(1)[0][1]
>>> timeit.timeit("max_word_2(x)", setup="from __main__ import x, max_word_2", number=1000)
1.1040630340576172
>>> timeit.timeit("max_word_frequency(x)", setup="from __main__ import x, max_word_frequency", number=1000)
35.623037815093994

Just this change might be sufficient to speed up your process :)

Using Counter was exactly the kind of thing I was looking for! Thank you very much, this was perfect and very helpful! It went from taking minutes and minutes for 200,000+ words to a few seconds. Definitely gonna remember this trick. — Finn LeSueur, May 29 '14 at 10:34
What matters most here is not that "it is implemented in C" (and not in C++) - it is that by counting all words at once, the algorithms just walks the list ONCE - while the OP's suggestion walks the whole list once for each word in his set. — jsbueno, May 29 '14 at 11:21
@jsbueno Yes indeed, I didn't really look in details Finn's algorithm (since I already knew I will answer with `Counter`) but I have add one paragraph about it. — Maxime Lorant, May 29 '14 at 11:52

Increasing speed/performance in counting through very large lists in Python

1 Answers1

Performance comparaison