How can I save a big `numpy` as '*.npz' array with limited filesystem capacity?

Question

I have a numpy array which saved as an uncompressed '*npz' file is about 26 GiB as it is numpy.float32 and numpy.savez() ends with:

OSError: Failed to write to /tmp/tmpl9v3xsmf-numpy.npy: 6998400000 requested and 3456146404 written

I suppose saving it compressed may save the day, but with numpy.savez_compressed() I have also:

OSError: Failed to write to /tmp/tmp591cum2r-numpy.npy: 6998400000 requested and 3456157668 written

as numpy.savez_compressed() saves the array uncompressed first.

The obvious "use additional storage" I do not consider an answer. ;)

[EDIT]

The tag low-memory refers to disk memory, not RAM.

What if you try to save it as an hdf5 file? Check this https://stackoverflow.com/questions/20928136/input-and-output-numpy-arrays-to-h5py — Ignacio Vergara Kausel, Feb 28 '18 at 12:19
Can't you use a lighter format like `float16`, `int8`, `uint8`, etc.? — Mazdak, Feb 28 '18 at 12:24
@IgnacioVergaraKausel It is worth a try if I find no way of saving it as '*.npz'. A lot of code depends on the format. — abukaj, Feb 28 '18 at 12:24
@Kasramvd unfortunately not, I have already moved from `float16` as it lacked precision. — abukaj, Feb 28 '18 at 12:25
@abukaj then, most likely you'll have to chunk your array into smaller pieces. Although I find it strange that you can have it all in memory but not on disk. — Ignacio Vergara Kausel, Feb 28 '18 at 12:26
Maybe you could pass a [gzip](https://docs.python.org/3/library/gzip.html) object as file to `np.savez`. — swenzel, Feb 28 '18 at 12:30
@IgnacioVergaraKausel I agree. I have 62.9G of RAM and 13 G of `/` (which also makes me worry why I am unable to save 6.9 G there) — abukaj, Feb 28 '18 at 12:32
If you have such a big array and need that precision, that _is_ how much it takes to store it. The only way you could really reduce it (besides generic compression) is if there are known patterns in the data, e.g. is it a sparse array, or are there repeated or derived values? If all the values have about the same exponent maybe storing only the mantissa in `int16`/`uint16` could be enough? Also, do you know what is your file system? It may limit the size of the files that you can store. — jdehesa, Feb 28 '18 at 12:47
If you keep your machine running, maybe an in-memory filesystem could also be useful. There is, of course, quite some risk of losing your data if the machine goes down unexpectedly. — swenzel, Feb 28 '18 at 13:05
@swenzel, ...unfortunately, `.npz` files are zip files, and zip format was built with an expectation of ability to `seek()`, so I don't expect writing directly to a gzip object (which is effectively append-only for purpose of writes) to work. Saving to tmpfs or another in-RAM filesystem, and only then moving the content to fixed-disk, is more appropriate. — Charles Duffy, Feb 28 '18 at 13:28
@CharlesDuffy well, I suppose then you have to save it to a BytesIO object first, then compress that. Which, due to memroy demand, is probably no solution either... — swenzel, Feb 28 '18 at 13:36
@swenzel Indeed, it is - I have already implemented it. It causes my system to swap, but let me save my data. — abukaj, Feb 28 '18 at 13:38
Which version of Python? I believe a more efficient solution is available with Python 3.6 or newer. — Charles Duffy, Feb 28 '18 at 16:15
@CharlesDuffy 3.6 I have already replaced `io.BytesIO()` with `ZipFile.open(..., mode='w')` — abukaj, Feb 28 '18 at 16:19

score 1 · Answer 1 · answered Feb 28 '18 at 13:36

Note: I would be more than happy to accept a more RAM-efficient solution.

I have browsed the numpy.savez_compressed() code and decided to reimplement part of its functionality:

import numpy as np
import zipfile
import io

def saveCompressed(fh, **namedict):
     with zipfile.ZipFile(fh,
                          mode="w",
                          compression=zipfile.ZIP_DEFLATED,
                          allowZip64=True) as zf:
         for k, v in namedict.items():
             buf = io.BytesIO()
             np.lib.npyio.format.write_array(buf,
                                             np.asanyarray(v),
                                             allow_pickle=False)
             zf.writestr(k + '.npy',
                         buf.getvalue())

It causes my system to swap, but at least I am able to store my data (sham data used in the example):

>>> A = np.ones(12 * 6 * 6 * 1 * 6 * 6 * 10000* 5* 9, dtype=np.float32)
>>> saveCompressed(open('test.npz', 'wb'), A=A)
>>> A = np.load('test.npz')['A']
>>> A.shape
(6998400000,)
>>> (A == 1).all()
True

Charles Duffy · Accepted Answer · 2018-02-28T18:07:34.717

1

With the addition of ZipFile.open(..., mode='w') in Python 3.6, you can do better:

import numpy as np
import zipfile
import io

def saveCompressed(fh, **namedict):
     with zipfile.ZipFile(fh, mode="w", compression=zipfile.ZIP_DEFLATED,
                          allowZip64=True) as zf:
         for k, v in namedict.items():
             with zf.open(k + '.npy', 'w', force_zip64=True) as buf:
                 np.lib.npyio.format.write_array(buf,
                                                 np.asanyarray(v),
                                                 allow_pickle=False)

edited Feb 28 '18 at 18:07

answered Feb 28 '18 at 16:18

Charles Duffy

280,126
43
390
441

Looks almost exactly as implementation I am testing right now, with the exception of `with zf.open(k + '.npy', mode='w', force_zip64=True) as buf:` – abukaj Feb 28 '18 at 16:24
Using `zf.open()` is the key difference, since it allows the file created inside the zip to be written incrementally (thus, with a sane ZipFile implementation, with bounded memory usage). – Charles Duffy Feb 28 '18 at 16:32
I mean the `, force_zip64=True` part. – abukaj Feb 28 '18 at 16:34
Your code ends with `ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.`. I guess it is about the size of the array. Would you mind to include the `frorce_zip64=True` part in your answer? – abukaj Feb 28 '18 at 16:40

How can I save a big `numpy` as '*.npz' array with limited filesystem capacity?

2 Answers2