3

In relation to my other question here, this code works if I use a small chunk of my dataset with dtype='int32', using a float64 produces a TypeError on my main process after this portion because of safe rules so I'll stick to working with int32 but nonetheless, I'm curious and want to know about the errors I'm getting.

fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)
fp[:] = matrix[:]

If I use the full data (where shape=(329568, 27519)), with these dtypes:

I get OverflowError when I use int32 or int

and

I get WindowsError when I use float64

Why and how can I fix this?

Edit: Added Tracebacks

Traceback for int32

Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
    fp = np.memmap("E:/TDM-memmap.txt", dtype='int32', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
WindowsError: [Error 8] Not enough storage is available to process this command

Traceback for float64

Traceback (most recent call last):
File "C:/Users/zeferinix/PycharmProjects/Projects/NLP Scripts/NEW/LDA_Experimental1.py", line 123, in <module>
    fp = np.memmap("E:/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
File "C:\Python27\lib\site-packages\numpy\core\memmap.py", line 260, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OverflowError: cannot fit 'long' into an index-sized integer

Edit: Added other info

Other info that might help: I have a 1TB (931 GB usable) HDD with 2 partitions, Drive D (22.8GB free of 150GB) where my work files are including this script and where the memmap will be written and Drive E (406GB free of 781GB) where my torrent stuff goes. At first I tried to write the mmap file to Drive D and it generated a 1,903,283kb file for int32 and 3,806,566kb file for float64. I thought maybe because it's running out of space that's why I get those errors so I tried Drive E which should be more than enough but it generated the same file size and gave the same error.

Community
  • 1
  • 1
ZeferiniX
  • 500
  • 5
  • 18
  • You won't be able to read that file in one go using `np.genfromtxt` - the resulting array will take up ~36GB of RAM using int32, and double that for int or float64. The point of using a memory-mapped array here is that it allows you read the file in smaller chunks, then write each chunk to the memory-mapped array so that you don't have to hold the whole thing in memory at once. Take another look at my answer [here](http://stackoverflow.com/a/34533601/1461210) to see how this might work. – ali_m Jan 04 '16 at 11:15
  • @ali_m added traceback for int32 and float64, both reports `memmap.py` – ZeferiniX Jan 04 '16 at 13:29
  • @ali_m I have a 1TB (931 GB usable) HDD with 2 partitions, `Drive D` (**22.8GB** free of 150GB)` where my work files are including this script and where the memmap will be written and `Drive E` (**406GB** free of 781GB) where my torrent stuff goes. At first I tried to write the mmap file to `Drive D` and it generated a **1,903,283kb** txt file for **int32** and **3,806,566kb** txt file for **float64**. I thought maybe because it's running out of space that's why I get those errors so I tried `Drive E` which should be more than enough but it generated the same file size and gave the same error. – ZeferiniX Jan 04 '16 at 13:45
  • @ali_m done, I'm using the 32bit version of numpy. – ZeferiniX Jan 04 '16 at 13:49
  • @ali_m Sorry for the ambiguity, yes, I'm referring to the file generated by `np.memmap`. I added a `.txt` extension when I created the file so I mistaken it for a txt file, my bad. – ZeferiniX Jan 04 '16 at 13:53

1 Answers1

4

I don't think it is possible to generate an np.memmap file that large using a 32 bit build of numpy, regardless of how much disk space you have.

The error occurs when np.memmap tries to call mmap.mmap internally. The second argument to mmap.mmap specifies the length of the file in bytes. For 329568 by 27519 array containing 64 bit (8 byte) values, the length will be 72555054336 bytes (i.e. ~72GB).

The value 72555054336 needs to be converted to an integer type that can be used as an index. In 32 bit Python, indices need to be 32 bit integer values. However, the largest number that can be represented by a 32 bit integer is much smaller than 72555054336:

print(np.iinfo(np.int32(1)).max)
# 2147483647

Even a 32 bit array would require a length of 36277527168 bytes, which is still about 16x larger than the largest representable 32 bit integer.

I don't see any easy way around this problem besides switching to 64 bit Python/numpy. There are other very good reasons to do this - 32 bit Python can only address a maximum of 3GB of RAM, even though your machine has 8GB available.


Even if you could generate an np.memmap that big, the next line

matrix = np.genfromtxt("Results/TDM-short.csv", dtype='int32', delimiter=',', skip_header=1)

will definitely fail, since it requires creating an array in RAM that's 32GB in size. The only way that you could possibly read that CSV file is in smaller chunks, like in my answer here that I linked to in the comments above.

As I mentioned in the comments for your other question, what you ought to do is convert your TermDocumentMatrix to a scipy.sparse matrix rather than writing it to a CSV file. This would require much, much less storage space and RAM, since it can take advantage of the fact that almost all of the word counts are zero-valued.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • I see, makes sense. Thank you for the precise calculations and explanation! Will probably give up on this approach since it's inefficient. I don't know any other ways to work my way on this before I started to ask, will give scipy.sparse a shot first before switching to 64bit numpy if it fails, thanks again! – ZeferiniX Jan 04 '16 at 14:33
  • It's well worth learning how to do these sorts of calculations, since they can quickly tell you whether or not what you're trying to do is feasible. – ali_m Jan 04 '16 at 14:56