12

I have a 3.374Gb npz file, myfile.npz.

I can read it in and view the filenames:

a = np.load('myfile.npz')
a.files

gives

['arr_1','arr_0']

I can read in 'arr_1' ok

a1=a['arr_1']

However, I cannot load in arr_0, or read its shape:

a1=a['arr_0']
a['arr_0'].shape

both above operations give the following error:

ValueError: array is too big

I have 16Gb RAM of which 8.370Gb is available. So the problem doesn't seem related to memory. My questions are:

  1. Should I be able to read this file in?

  2. Can anyone explain this error?

  3. I have been looking at using np.memmap to get around this - is this a reasonable approach?

  4. What debugging approach should I use?

EDIT:

I got access to a computer with more RAM (48GB) and it loaded. The dtype was in fact complex128 and the uncompressed memory of a['arr_0'] was 5750784000 bytes. It seems that a RAM overhead may be required. Either that or my predicted amount of available RAM was wrong (I used windows sysinternals RAMmap).

Lee
  • 29,398
  • 28
  • 117
  • 170
  • 1
    I suspect the reason you can't load it is because representing, for example, 3.4 as a float in the computer memory requires more memory than saving 3.4 on disk. But I'm less sure of that than I was before I started reading up on it. – Joel Feb 02 '15 at 11:16
  • Do you know if the file is compressed (was it created using [`np.savez_compressed()`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez_compressed.html#numpy.savez_compressed))? Did you create it on the same machine you are trying to load it on? Do you know what kind of arrays it contains (size and dtype)? – ali_m Feb 02 '15 at 11:46
  • @ali_m, yes was saved with `np.savez_compressed` but on a different machine. The `arr_0` is floats (I think 8 byte) of shape `(200,1440,3,12,32)`, `arr_1` is (200,3,32) again floats. – Lee Feb 02 '15 at 11:53
  • @atomh33ls Are you sure those dimensions are correct? If so, then `arr_0` should only be ~2.5GB in memory (assuming double floats). – ali_m Feb 02 '15 at 12:01
  • Also, I only need a portion of the array a['arr_0'][0,:,:,9,:] - giving a shape of (1440,3,32) – Lee Feb 02 '15 at 12:02
  • @ali_m ah yes apologies - should be `(200,1440,3,13,32)` I think this gives ~ 3.32 GB for `arr_0` – Lee Feb 02 '15 at 12:04
  • @ali_m Note that I was wrong, the dtype was `complex128` (see revised question) – Lee Feb 02 '15 at 15:11
  • 2
    Try ```mmap_mode='r'``` as an additional argument for ```np.load```. This should not load the arrays into memory but keep them on the disk. Unless your copying it afterwards to another array. – oschoudhury Feb 02 '15 at 15:37
  • 2
    @Wicket that won't work for `.npz` files containing multiple arrays (at least on my machine). When you actually try to access one of the arrays using `a['x']`, the whole contents will be read into memory as a standard `np.array` rather than an `np.memmap`, regardless of whether you specify a `mmap_mode=`. – ali_m Feb 02 '15 at 15:47

1 Answers1

3

An np.complex128 array with dimensions (200, 1440, 3, 13, 32) ought to take up about 5.35GiB uncompressed, so if you really did have 8.3GB of free, addressable memory then in principle you ought to be able to load the array.

However, based on your responses in the comments below, you are using 32 bit versions of Python and numpy. In Windows, a 32 bit process can only address up to 2GB of memory (or 4GB if the binary was compiled with the IMAGE_FILE_LARGE_ADDRESS_AWARE flag; most 32 bit Python distributions are not). Consequently, your Python process is limited to 2GB of address space regardless of how much physical memory you have.

You can either install 64 bit versions of Python, numpy, and any other Python libraries you need, or live with the 2GB limit and try to work around it. In the latter case you might get away with storing arrays that exceed the 2GB limit mainly on disk (e.g. using np.memmap), but I'd advise you to go for option #1, since operations on memmaped arrays are a lot slower in most cases than for normal np.arrays that reside wholly in RAM.


If you already have another machine that has enough RAM to load the whole array into core memory then I would suggest you save the array in a different format (either as a plain np.memmap binary, or perhaps better, in an HDF5 file using PyTables or H5py). It's also possible (although slightly trickier) to extract the problem array from the .npz file without loading it into RAM, so that you can then open it as an np.memmap array residing on disk:

import numpy as np

# some random sparse (compressible) data
x = np.random.RandomState(0).binomial(1, 0.25, (1000, 1000))

# save it as a compressed .npz file
np.savez_compressed('x_compressed.npz', x=x)

# now load it as a numpy.lib.npyio.NpzFile object
obj = np.load('x_compressed.npz')

# contains a list of the stored arrays in the format '<name>.npy'
namelist = obj.zip.namelist()

# extract 'x.npy' into the current directory
obj.zip.extract(namelist[0])

# now we can open the array as a memmap
x_memmap = np.load(namelist[0], mmap_mode='r+')

# check that x and x_memmap are identical
assert np.all(x == x_memmap[:])
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • I suspect an array may require a contiguous range of virtual memory space, at least for some parts of it. – ivan_pozdeev Feb 03 '15 at 01:36
  • 1
    I can also be that he's using an x32 process that is limited to 4GB address space. – ivan_pozdeev Feb 03 '15 at 01:38
  • @atomh33ls can you confirm whether you are running 32bit Windows, or are otherwise limited to 4GB of address space? Are you able to allocate a new 4GB numpy array (e.g. `foo = np.ones(536870912, np.float64)`)? – ali_m Feb 03 '15 at 03:07
  • @ali_m I'm using 64bit Windows7 - The example you give also fails with the `ValueError: array is too big` – Lee Feb 03 '15 at 12:12
  • @atomh33ls Are you using 32 bit versions of Python/numpy? – ali_m Feb 03 '15 at 12:18
  • 1
    @atomh33ls Whelp, there's your problem! Either install the 64 bit versions of Python, numpy, and any other Python libraries you need, or live with the 4GB limit on your addressable memory. – ali_m Feb 03 '15 at 13:09
  • 3
    Thanks @ali_m, I had naively thought that the OS would just allow all available RAM to be used. Also, for people reading in the future, I found [this](http://stackoverflow.com/a/18282931/1461850) and [this](https://msdn.microsoft.com/en-us/library/aa366778.aspx#memory_limits) useful – Lee Feb 03 '15 at 14:04