2
text = codecs.open("lith.txt", encoding= 'utf-8')
text = text.read().lower().replace('"','').replace('?','').replace(',','').replace('!','').replace('.','')
text = text.split()
words = sorted(list(set(text)))
Unigram = np.zeros([len(words)])
ind = range(len(words))
Lexicon = dict(zip(words,ind))
Bigram = np.zeros([len(words),len(words)])

I keep running into major issues with the last line of this portion of the program. The text file is maybe about 7,000,000 words long. Currently, the number of words/length is about 200,000. When I cut the text file to a point where the length of words become 40,000 or so, the program works. Is there anyway to get around this memory limitation? Thanks for any help. The results I get in later parts of the program really seem to suffer if I just keep cutting out portions of the text until the memory errors goes away.

for n in range(len(text)-1):
    Unigram[Lexicon[text[n]]] = Unigram[Lexicon[text[n]]] + 1
    Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] = Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] + 1
Unigram_sorted = np.argsort(Unigram)
Unigram_sorted = Unigram_sorted[::-1]
Unigram_sorted = Unigram_sorted[0:4999]

1 Answers1

3

I assume that the line that raises the exception in:

Bigram = np.zeros([len(words),len(words)])

If len(words) is 200,000, then the size of the matrix is 200,000^2 integers. Assuming int64, this requires 320gb of memory.

Assuming most entries will remain zeros, sparse matrices could be helpful. For example, scipy's sparse matrices. In the case of counting joint pairs, this snippet could be of help:

from scipy.sparse.dok import dok_matrix

Bigrams = dok_matrix((len(words), len(words)))
# Bigrams[i, j] += 1

Regarding the code itself, the first part may have a relatively similar implementation at scikit-learn text vectorizers.

Elisha
  • 23,310
  • 6
  • 60
  • 75
  • That is indeed the line that is causing problems; it works when length of words if much smaller. I was looking at sparse matrices but I wasn't sure how to use them. Can you offer any advice in how I might set one up? – zettasyntax19 May 02 '17 at 13:26
  • Sure, can you elaborate on what do you need to do with the matrix? – Elisha May 02 '17 at 13:27
  • The next portion of the program basically calculates the bigram frequency of each word in words and then I save the 5,000 most frequent rows and 5,000 most frequent columns, so I really don't end up using most of that massive matrix in the rest of the program. – zettasyntax19 May 02 '17 at 13:35
  • I've edited and added an example of `dok_matrix()`. According to the usage post construction, it may be changed to representation that is more efficient in regard to slicing, such as `csr_matrix` and `csc_matrix`. – Elisha May 02 '17 at 13:41
  • I have another possibly silly question. When I try to iterate through the matrix, I get errors like "out of bounds" so I guess I'm really not understanding how a sparse matrix works. – zettasyntax19 May 02 '17 at 14:13
  • It depends on the representation you choose, in general, this is a good guideline: http://stackoverflow.com/questions/4319014/iterating-through-a-scipy-sparse-vector-or-matrix – Elisha May 02 '17 at 14:58
  • Your question about bounds in a sparse matrix should be put in a new question. We need to know how you are construct the matrix, and whether the bounds issue arises during construction or some later use. The learning curve for sparse matrices is not trivial. – hpaulj May 02 '17 at 16:03