10

I have a case where I would like to open a compressed numpy file using mmap mode, but can't seem to find any documentation about how it will work under the covers. For example, will it decompress the archive in memory and then mmap it? Will it decompress on the fly?

The documentation is absent for that configuration.

Refefer
  • 541
  • 4
  • 11
  • Are you talking about a file created with `np.savez`? Or one created with `np.save` and then compressed? `npz` files are loaded with `np.lib.npyio.NpzFile`. Look at its code. – hpaulj Mar 16 '15 at 16:10
  • 3
    @hpaulj is correct, although [it is possible](http://stackoverflow.com/a/28281287/1461210) to extract a compressed array from an `.npz` archive to disk, then open the decompressed array in memmap mode. For on-the-fly compression and decompression you should really be looking at HDF5 ([PyTables](http://www.pytables.org) or [h5py](http://www.h5py.org)). – ali_m Mar 17 '15 at 01:54

1 Answers1

14

The short answer, based on looking at the code, is that archiving and compression, whether using np.savez or gzip, is not compatible with accessing files in mmap_mode. It's not just a matter of how it is done, but whether it can be done at all.

Relevant bits in the np.load function

elif isinstance(file, gzip.GzipFile):
    fid = seek_gzip_factory(file)
...
    if magic.startswith(_ZIP_PREFIX):
        # zip-file (assume .npz)
        # Transfer file ownership to NpzFile
        tmp = own_fid 
        own_fid = False
        return NpzFile(fid, own_fid=tmp)
...
    if mmap_mode:
        return format.open_memmap(file, mode=mmap_mode)

Look at np.lib.npyio.NpzFile. An npz file is a ZIP archive of .npy files. It loads a dictionary(like) object, and only loads the individual variables (arrays) when you access them (e.g. obj[key]). There's no provision in its code for opening those individual files inmmap_mode`.

It's pretty obvious that a file created with np.savez cannot be accessed as mmap. The ZIP archiving and compression is not the same as the gzip compression addressed earlier in the np.load.

But what of a single array saved with np.save and then gzipped? Note that format.open_memmap is called with file, not fid (which might be a gzip file).

More details on open_memmap in np.lib.npyio.format. Its first test is that file must be a string, not an existing file fid. It ends up delegating the work to np.memmap. I don't see any provision in that function for gzip.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I recently came to the same conclusion by looking at the code. I was wondering if it would be hard to add this functionality. Given the fact that `numpy` developers are brilliant it would be surprinsing they did not even try. Do you have an opinion on that? – Mathieu Dubois Jun 24 '15 at 07:45
  • It was pointed out in another recent `npz` question, that you can go the other direction - create the compressed archive in memory (using StringIO). In all these cases the `numpy` developers haven't done special `C` work - they use existing `python` modules (`mmap`, `zip`, etc). `np.save` via `np.lib.npyio` is doing the speicalized array work, and even there it 'punts' to `pickle` when the going gets hard (e.g. saving dtype objects). – hpaulj Jun 24 '15 at 17:52
  • Not sure I understand your comment. It seems that opening the array decompress it in memory (it is probably on-the-fly decompression) so I guess one can create a `numpy.memap` object from that (the `bytes` variable). – Mathieu Dubois Jun 26 '15 at 15:53
  • Dig into the `np.lib.npyio` module, with side trips to `zipfile` and `mmap`. – hpaulj Jun 26 '15 at 17:01
  • http://stackoverflow.com/a/25837662/901925 demonstrates creating `savez` and `load` using a `io.BytesIO`, i.e. making an in memory compressed file. – hpaulj Jun 26 '15 at 17:07
  • Look particularly at `numpy.lib.npyio.NpzFile` to see how a variable is read from a npz file. – hpaulj Jun 26 '15 at 17:13
  • Hum, I don't think one would want a memory compressed array. I was thinkink about simple creating a standard memory mapped array from the npz file. – Mathieu Dubois Jul 02 '15 at 07:34
  • the in memory compressed file isn't a solution to your problem, but understanding it might give insight into how compression, memory mapping and np.save interact. But I doubt if there is a solution short of writing your own `zip` archiving/compressing code (in C). – hpaulj Jul 02 '15 at 16:04