text = codecs.open("lith.txt", encoding= 'utf-8')
text = text.read().lower().replace('"','').replace('?','').replace(',','').replace('!','').replace('.','')
text = text.split()
words = sorted(list(set(text)))
Unigram = np.zeros([len(words)])
ind = range(len(words))
Lexicon = dict(zip(words,ind))
Bigram = np.zeros([len(words),len(words)])
I keep running into major issues with the last line of this portion of the program. The text file is maybe about 7,000,000 words long. Currently, the number of words/length is about 200,000. When I cut the text file to a point where the length of words become 40,000 or so, the program works. Is there anyway to get around this memory limitation? Thanks for any help. The results I get in later parts of the program really seem to suffer if I just keep cutting out portions of the text until the memory errors goes away.
for n in range(len(text)-1):
Unigram[Lexicon[text[n]]] = Unigram[Lexicon[text[n]]] + 1
Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] = Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] + 1
Unigram_sorted = np.argsort(Unigram)
Unigram_sorted = Unigram_sorted[::-1]
Unigram_sorted = Unigram_sorted[0:4999]