3

I'm trying to use this lda package to process a term-document matrix csv file with 39568 rows and 27519 columns containing counting/natural numbers only.

Problem: I'm getting a MemoryError with my approach to read the file and store it to a numpy array.

Goal: Get the numbers from the TDM csv file and convert it to numpy array so I can use the numpy array as input for the lda.

with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
    matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])

I've also tried using the numpy append, vstack and concatenate and I still get the MemoryError.

Is there a way to avoid the MemoryError?

Edit:

I've tried using dtype int32 and int and it gives me:

WindowsError: [Error 8] Not enough storage is available to process this command

I've also tried using dtype float64 and it gives me:

OverflowError: cannot fit 'long' into an index-sized integer

With these codes:

fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]

and

with open("Results/TDM.csv", 'r') as tdm_file:
    vocabulary = [value for value in tdm_file.readline().strip().split(',')]
    fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
    for idx, line in enumerate(tdm_file):
        fp[idx] = np.array(line.strip().split(','))

Other info that might matter

  • Win10 64bit
  • 8GB RAM (7.9 usable) | peaks at 5.5GB from more or less 3GB (around 2GB used) before it reports MemoryError
  • Python 2.7.10 [MSC v.1500 32 bit (Intel)]
  • Using PyCharm Community Edition 5.0.3
ZeferiniX
  • 500
  • 5
  • 18
  • Have you tried [numpy.loadtxt](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html)? – karlson Jan 03 '16 at 20:08
  • 1
    Separate the list comprehension (that makes a nested list of lists) from the `array` call. Which one produces the memory error? `loadtxt`, `genfromtxt` do essentially what you are doing - collecting values in a list and making the array at the end. – hpaulj Jan 03 '16 at 20:08
  • 2
    Depending on how many zeros are in your dataset, it may be useful to use a sparse matrix format to avoid memory errors. – Ryan Jan 03 '16 at 20:21
  • @karlson Yes, just now and I get the error from `...\numpy\lib\npyio.py, line 916, in loadtxt` which says `for i, line in enumerate(itertools.chain([first_line], fh)):` followed by the `MemoryError` – ZeferiniX Jan 03 '16 at 20:44
  • @hpaulj `for line in tdm_file` produces the MemoryError – ZeferiniX Jan 03 '16 at 20:45
  • @Ryan If I understood what the [textmining](http://pydoc.net/Python/textmining/1.0/textmining/) package says (which I used to generate the csv file), particularly on the `TermDocumentMatrix Class` inside the `__init__`, it's a sparse matrix but I can't use this because the type is `` and the lda package needs a numpy array so I decided to use the `write_csv` method of the `textmining` package and read the generated csv file for this purpose. Dunno if that's a useful info for that end. – ZeferiniX Jan 03 '16 at 20:54
  • 1
    What dtype(s) will the final array contain? If you can't hold the entire .csv file in memory you can read sequential chunks of rows ([e.g. here](http://stackoverflow.com/a/34533601/1461210)), then write them to a (possibly [memory-mapped](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.memmap.html)) numpy array or an HDF5 file. – ali_m Jan 03 '16 at 22:42
  • So the error occurs while you are still collecting data in the list of lists. Converting each line into an array (but still collecting them in a list) might save some space, especially if you can use a smaller `dtype`. – hpaulj Jan 03 '16 at 23:16
  • @ali_m Not sure which one to use but I think the smallest one would be the best since the numeric values inside the file rarely has 3 digits. I tried using `dtype='int32'` and it gave me a `WindowsError`, tried also `dtype='float64'` and it gave me an `OverflowError: cannot fit 'long' into an index-sized integer`. Using these codes `fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))` followed by `matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)` and copied the values using `fp[:] = matrix[:]` – ZeferiniX Jan 04 '16 at 07:06
  • @ali_m my bad, done. When the shape is small, it works and I can use it with the lda package but when I use the actual values for my problem `shape=(39568, 27519)`, it gives me those errors with respect to the dtype used. – ZeferiniX Jan 04 '16 at 08:16
  • @hpaulj tried, still gave me an error. Check my edit. – ZeferiniX Jan 04 '16 at 08:24
  • You would be much better off exploiting the sparsity of your original `TermDocumentMatrix`. Based on the [documentation](http://pydoc.net/Python/textmining/1.0/textmining/) you linked to, a `TermDocumentMatrix` is essentially a list of `{word:count}` dicts. You could construct a [`scipy.sparse`](http://docs.scipy.org/doc/scipy/reference/sparse.html) matrix from this, then pass it directly to `lda.LDA.fit`. Saving the whole matrix to a CSV file is very inefficient in terms of storage space and read/write time. – ali_m Jan 04 '16 at 11:25

1 Answers1

1

Since your word counts will be almost all zeros, it would be much more efficient to store them in a scipy.sparse matrix. For example:

from scipy import sparse
import textmining
import lda

# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")

# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()

# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)

# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
    for word, count in doc.iteritems():
        jj = words.index(word)
        X[ii, jj] = count

X is now an (ndocs, nwords) scipy.sparse.lil_matrix, and words is a list corresponding to the columns of X:

print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']

print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
#  [0 0 0 0 0 0 1 1 1 1 0 1 0]
#  [0 1 1 1 0 0 0 0 0 0 0 0 0]
#  [0 0 0 0 1 0 1 0 0 1 0 1 0]]

You could pass X directly to lda.LDA.fit, although it will probably be faster to convert it to a scipy.sparse.csr_matrix first:

X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Took me a while to install SciPy and use it on PyCharm. Ended up using the Scipy from [Unofficial Windows Binaries for Python Extension Packages](http://www.lfd.uci.edu/~gohlke/pythonlibs/). Tried the code above with my data, it's working and more faster! Thank you for the quick guide on converting the `tdm` to a `scipy sparse matrix` and for your time! – ZeferiniX Jan 04 '16 at 16:48