How to speed up this word-tuple finding algorithm?

Question

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:

tuples = [[]] #array for word tuples to be stored in
data = []   #array for tuple frequencies to be stored in

data.append(0) #tuples array starts with an empty element at the beginning for some reason.
            # Adding zero to the beginning of the frequency array levels the indexes of the two arrays

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tupleIndex = 0
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
                data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
            else:
                tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
                data.append(1)                                                        #add the pair to list and set frequency to 1.

        #print every 1000th line to check the progress
        lineIndex += 1
        if ((lineIndex % 1000) == 0):
            print(lineIndex)

with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
    #write tuples to txt file
    for tuple in tuples:
        if (len(tuple) > 0): # if tuple is not epmty
            markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")
    #blank spaces between two data

    #write frequencies of the tuples to txt file
    for element in data:
        markovWindowSize1File.write(str(element) + " ")

    markovWindowSize1File.write("\n")
    markovWindowSize1File.write("\n")

This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.

I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.

Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.

Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?

Thanks.

Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:

import numpy as np
import time

start = time.time()
myDict = {} # empty dict

with open("markovData.txt") as f:
    contentData = f.readlines()
    contentData = [x.strip() for x in contentData]
    lineIndex = 0
    for line in contentData:
        tmpArray = line.split() #split line to array of words
        tmpArrayIndex = 0
        for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
            if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
                myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1  #increment the frequency of said pair
        else:
            myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
                                                                              #add the pair to list and set frequency to 1.

    #print every 1000th line to check the progress
    lineIndex += 1
    if ((lineIndex % 1000) == 0):
        print(lineIndex)


end = time.time()
print(end - start)

keyText= ""
valueText = ""

for key1,key2 in myDict:
    keyText += (str(key1) + "," + str(key2) + " ")
    valueText += (str(myDict[key1,key2]) + " ")


with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
    markovPairsFile.write(keyText)

with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
    markovFrequencyFile.write(valueText)

Instead of using a list of lists, you could try a flat dictionary where the keys are the appended word ids K;L. I assume that for the majority of words, they will not pair. Then instead of checking whether you've seen a word pair K,L before (that is the bottleneck of your algorithm right now) you can simply increase mydict[K;L] = mydict[K;L] + 1. — marktani, May 25 '19 at 11:51
Also, you are writing to file again and again. Instead, join the string once and write it once to file. Refer to this answer: https://stackoverflow.com/a/27384379/1176596. This is actually the bigger bottleneck I believe. — marktani, May 25 '19 at 11:54

score 1 · Accepted Answer · answered May 25 '19 at 11:52

As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form

{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.

This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)

score 1 · Answer 2 · answered May 25 '19 at 12:33

I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:

from collections import Counter

word_counts = Counter()
with open("markovData.txt", "r") as f:
    # iterate over word pairs
    word_counts[(word1, word2)] += 1

Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:

word_counts = Counter(word_tuple_list)

How to speed up this word-tuple finding algorithm?

2 Answers2