22

I'm getting a memory issue I can't seem to understand.

I'm on a windows 7 64 bit machine with 8GB of memory and running a 32bit python program.

The programs reads a 5,118 zipped numpy files (npz). Windows reports that the files take up 1.98 GB on disk

Each npz file contains two pieces of data: 'arr_0' is of type np.float32 and 'arr_1' is of type np.uint8

The python script reads each file appends their data into two lists and then closes the file.

Around file 4284/5118 the program throws a MemoryException

However, the task manager says that the memory usage of python.exe *32 when the error occurs is 1,854,848K ~= 1.8GB. Much less than my 8 GB limit, or the supposed 4GB limit of a 32bit program.

In the program I catch the memory error and it reports: Each list has length 4285. The first list contains a total of 1,928,588,480 float32's ~= 229.9 MB of data. The second list contains 12,342,966,272 uint8's ~= 1,471.3MB of data.

So, everything seems to be checking out. Except for the part where I get a memory error. I absolutely have more memory, and the file which it crashes on is ~800KB, so its not failing on reading a huge file.

Also, the file isn't corrupted. I can read it just fine, if I don't use up all that memory beforehand.

To make things more confusing, all of this seems to work fine on my Linux machine (although it does have 16GB of memory as opposed to 8GB on my Windows machine), but still, it doesn't seem to be the machine's RAM that is causing this issue.

Why is Python throwing a memory error, when I expect that it should be able to allocate another 2GB of data?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Erotemic
  • 4,806
  • 4
  • 39
  • 80
  • The amount of physical RAM you have is irrelevant. On Windows, you've always got swap, whether you want it or not. – abarnert Aug 16 '13 at 22:14
  • When this works on your linux machine… is that with a 32-bit Python as well? – abarnert Aug 16 '13 at 22:14
  • 1
    could you post the code you are using to load the `.npz` file? if you use `np.load(file, mmap_mode='r+')` it will use much less memory, since with this argument you will open a [`memory-mapped array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html)... – Saullo G. P. Castro Aug 16 '13 at 22:15
  • Is there a reason you're using a 32-bit Python interpreter to process gigabytes of data? You're just making things harder for yourself, and unless you have some good reason, why do that? – abarnert Aug 16 '13 at 22:33
  • @abarnert Because installing 64bis python and its dependencies on windows is a pain ? – J. Martinot-Lagarde Aug 19 '13 at 14:17
  • @J.Martinot-Lagarde: I just downloaded the "Python Windows X86_64 MSI Installer" from python.org and ran it, and it works fine. Christoph Gohlke's [unofficial Windows binaries](http://www.lfd.uci.edu/~gohlke/pythonlibs/) has the same set of packages for both 32- and 64-bit. Setting up to build C extensions locally is a pain, but it has the same dependencies and the same set of steps either way. What part are you having problems with? – abarnert Aug 19 '13 at 19:19
  • 1
    The problem is not with python itself but with numpy and scipy, which needs a 64bit fortran compiler. The only existing one if from Intel if I remember correctly, and is not free. I know that you can use [WinPyton](https://code.google.com/p/winpython/) so it is possible. Still, you have to use unofficial binaries from a website not affiliated to python.org. – J. Martinot-Lagarde Aug 20 '13 at 07:08
  • @abarnet yes linux is 32bit as well. The reason I'm using 32bit python is because it is more stable than 64, and the project I'm working on needs to work on older machines (albeit without that much memory). Also, python makes things easy much more often than it makes things difficult. – Erotemic Aug 21 '13 at 02:47
  • @Saullo Castro Thanks, I'll make sure I do that, regardless of it solving this problem. I was just using np.load(file) – Erotemic Aug 21 '13 at 02:48
  • @J.Martinot-Lagarde: How long has it been since you checked that? `gfortran` builds for MinGW64 just as easily as for MinGW, and has for a while now. Or you can download binaries from [the exact same page at gcc.gnu.org](http://gcc.gnu.org/wiki/GFortranBinaries) as the MinGW32-for-Win64 binaries. – abarnert Aug 21 '13 at 19:23
  • 2
    @Erotemic: What makes you think 32-bit Python is more stable? Most of the core devs are on 64-bit Unix boxes nowadays. There have been many bugs and performance issues where a change made 32-bit worse and nobody noticed for months, and very few in the other direction. I can understand having build/toolchain problems, but if your reason really is thinking that Python itself is unstable in 64 bits, you're very wrong. – abarnert Aug 21 '13 at 19:25
  • @abarnert: the post is from august 2012: http://spyder-ide.blogspot.fr/2012/08/scientific-python-distribution-for.html – J. Martinot-Lagarde Aug 22 '13 at 07:09
  • 1
    @J.Martinot-Lagarde: Over a year ago, there were two problems. First, gfortran wasn't ready for prime time for MinGW64, but that isn't true today; it's now one of the fully-supported platforms, just like MinGW32 (and native Win32, cygwin, and all the *nixes). The other part of it—that Christoph Gohkle's Win64 binary scipy package may have had some unclear licensing that caused problems for anyone who wanted to distribute binary packages that used them—isn't relevant for many people, especially one who doesn't want to use "unofficial binaries from a website not affiliated to python.org". – abarnert Aug 22 '13 at 17:49
  • @abarnert Well, I guess that is news to me then. I'll look into switching to 64bit python. – Erotemic Aug 22 '13 at 21:14

1 Answers1

40

I don't know why you think your process should be able to access 4GB. According to Memory Limits for Windows Releases at MSDN, on 64-bit Windows 7, a default 32-bit process gets 2GB.* Which is exactly where it's running out.

So, is there a way around this?

Well, you could make a custom build of 32-bit Python that uses the IMAGE_FILE_LARGE_ADDRESS_AWARE flag, and rebuild numpy and all of your other extension modules. I can't promise that all of the relevant code really is safe to run with the large-address-aware flag; there's a good chance it is, but unless someone's already done it and tested it, "a good chance" is the best anyone is likely to know.

Or, more obviously, just use 64-bit Python instead.


The amount of physical RAM is completely irrelevant. You seem to think that you have an "8GB limit" with 8GB of RAM, but that's not how it works. Your system takes all of your RAM plus whatever swap space it needs and divides it up between apps; an app may be able to get 20GB of virtual memory without getting a memory error even on an 8GB machine. And meanwhile, a 32-bit app has no way of accessing more than 4GB, and the OS will use up some of that address space (half of it by default, on Windows), so you can only get 2GB even on an 8GB machine that's not running anything else. (Not that it's possible to ever be "not running anything else" on a modern OS, but you know what I mean.)


So, why does this work on your linux box?

Because your linux box is configured to give 32-bit processes 3.5GB of virtual address space, or 3.99GB, or… Well, I can't tell you the exact number, but every distro I've seen for many years has been configured for at least 3.25GB.


* Also note that you don't even really get that full 2GB for your data; your program. Most of what the OS and its drivers make accessible to your code sits in the other half, but some bits sit in your half, along with every DLL you load and any space they need, and various other things. It doesn't add up to too much, but it's not zero.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • You actually don't have to compile the exe on windows, the `IMAGE_FILE_LARGE_ADDRESS_AWARE` is just a flag in the image header (not that this would ever be officially supported but hey we aren't judging ;)). Also dlls have no say in the matter to begin with, so those don't have to be changed anyhow. – Voo Aug 16 '13 at 22:54
  • @Voo: But all of your code, including your DLLs, has to be safe to _use_ with the flag on. If, say, Python and its standard extension modules check at build time whether you want large-address-aware support and generate different code in different cases, you would need to rebuild everything, not just the exe. If they're _always_ large-address-safe, then you don't need to do anything. And if they're _never_ large-address-safe, then rebuilding won't help. I don't know of any documentation that tells you which of the three it is… – abarnert Aug 17 '13 at 00:59
  • True, although the only reason that code will fail with IMAGE_FILE_LARGE_ADDRESS_AWARE is if it is broken to begin with (signed pointer math) or does stupid tricks with the high order bit of pointers. I'm very surprised that python does this stuff - where exactly in the code? (GC I assume, it's pretty much the only reason where this may be useful) Would love to look at that. – Voo Aug 17 '13 at 14:28
  • @Voo: I have absolutely no idea if Python or any Python modules that the OP depends on do such a thing. I don't think it's likely, but I can't guarantee it. There's obviously some reason it's not built with `IMAGE_FILE_LARGE_ADDRESS_AWARE` out of the box; my guess is that the reason is that so far none of the devs has ever found it worth testing and/or scrubbing the source, because if they really need more than 2GB they just use a 64-bit build. But that's just a guess, which is why my answer said there's a good chance it will work but I can't promise it. – abarnert Aug 19 '13 at 19:22
  • Yeah but if there's no compiler switch for this that python depends on then going through all the work of rebuilding python will do exactly nothing compared to just changing a single bit in the header which was my point. And I really can't see how python could use anything that relies on the high order bit being unused - after all *nix generally doesn't give such guarantees and python runs fine there. – Voo Aug 19 '13 at 19:33
  • @Voo: I don't _know_ if there's a compiler switch for this that Python depends on (possibly even a Python-specific /D flag that's #if'd in the code). If you build it, you should be able to tell by looking at… I forget the Windows equivalent of `python-config --cflags`. Otherwise, all you can do is guess. I've said repeatedly that it's not _likely_ to be a problem, but I can't _promise_ that> So I don't know what you're trying to argue. Saying "I know you said it's not likely, but it's not likely" doesn't add anything. If you can _prove_ it's safe, great; if not, what else is there to say? – abarnert Aug 19 '13 at 19:46