-2
def get_word_frequencys(words):

    """given a list of words, returns a dictionary of the words,
    and their frequencys"""

    words_and_freqs = {}
    for word in words:
        words_and_freqs[word] = words.count(word)
    return words_and_freqs

The above function works fine for small files, however, I need it to work on a file 264505 words long, currently, my program takes several minutes for files this size.

How can I construct a dictionary in a more efficient way?

all relevent code:

def main(words):
    """
    given lots of words do things
    """
    words_and_frequencys = get_word_frequencys(words)

    print("loaded ok.")
    print()
    print_max_frequency(words, words_and_frequencys)


def get_word_frequencys(words):
    """given a list of words, returns a dictionary of the words,
    and their frequencys"""
    words_and_freqs = {}
    for word in words:
        words_and_freqs[word] = words.count(word)
    return words_and_freqs      


def print_max_frequency(words, words_and_frequencys):
    """given a dict of words and their frequencys,
    prints the max frequency of any one word"""
    max_frequency = 0
    for word in words:
        if words_and_frequencys.get(word) > max_frequency:
            max_frequency = words_and_frequencys.get(word)
    print(" " + "Maximum frequency = {}".format(max_frequency)) 

note for those suggesting Counter instead of Count(), I'm not allowed to import any modules apart from os and re.

James Lee
  • 3
  • 3

1 Answers1

0

Each time you call count on your list, you iterate over the whole thing (taking O(N) time). Since you do that for every word in the list, your whole operation takes O(N**2) time. You can do much better.

Instead of counting how many times a word you've just seen appears elsewhere in the list, why not just count the one occurrence of it that you've seen in your iteration? You can update the count if you see more copies later on. Since this only does a small amount of work for each word, the total running time will be linear instead of quadratic.

for word in words:
    words_and_freqs[word] = words_and_freqs.get(word, 0) + 1

If you don't like using dict.get, you can instead use an explicit if statement to check if the current word has been seen before:

for word in words:
    if word in words_and_freqs:
        words_and_freqs[word] += 1
    else:
        words_and_freqs[word] = 1
Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • If this anwers your question why don't you accept it as answer @JamesLee so others after you can find it easily? – Rarblack Oct 21 '18 at 12:19