15

Can I store a dictionary using np.savez? The results are surprising (to me at least) and I cannot find a way to get my data back by key.

In [1]: a = {'0': {'A': array([1,2,3]), 'B': array([4,5,6])}}
In [2]: a
Out[2]: {'0': {'A': array([1, 2, 3]), 'B': array([4, 5, 6])}}

In [3]: np.savez('model.npz', **a)
In [4]: a = np.load('model.npz')
In [5]: a
Out[5]: <numpy.lib.npyio.NpzFile at 0x7fc9f8acaad0>

In [6]: a['0']
Out[6]: array({'B': array([4, 5, 6]), 'A': array([1, 2, 3])}, dtype=object)

In [7]: a['0']['B']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-c916b98771c9> in <module>()
----> 1 a['0']['B']

ValueError: field named B not found

In [8]: dict(a['0'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-d06b11e8a048> in <module>()
----> 1 dict(a['0'])

TypeError: iteration over a 0-d array

I do not understand exactly what is going on. It seems that my data becomes a dictionary inside a 0-dimensional array, leaving me with no way to get my data back by key. Or am I missing something?

So my questions are:

  1. What happens here? If I can still access my data by key, how?
  2. What is the best way to store data of this type? (a dict with str as key and other dicts as value)

Thanks!

Louic
  • 2,403
  • 3
  • 19
  • 34

2 Answers2

23

It is possible to recover the data:

In [41]: a = {'0': {'A': array([1,2,3]), 'B': array([4,5,6])}}

In [42]: np.savez('/tmp/model.npz', **a)

In [43]: a = np.load('/tmp/model.npz')

Notice that the dtype is 'object'.

In [44]: a['0']
Out[44]: array({'A': array([1, 2, 3]), 'B': array([4, 5, 6])}, dtype=object)

And there is only one item in the array. That item is a Python dict!

In [45]: a['0'].size
Out[45]: 1

You can retrieve the value using the item() method (NB: this is not the items() method for dictionaries, nor anything intrinsic to the NpzFile class, but is the numpy.ndarray.item() method that copies the value in the array to a standard Python scalars. In an array of object dtype any value held in a cell of the array (even a dictionary) is a Python scalar:

In [46]: a['0'].item()
Out[46]: {'A': array([1, 2, 3]), 'B': array([4, 5, 6])}

In [47]: a['0'].item()['A']
Out[47]: array([1, 2, 3])

In [48]: a['0'].item()['B']
Out[48]: array([4, 5, 6])

To restore a as a dict of dicts:

In [84]: a = np.load('/tmp/model.npz')

In [85]: a = {key:a[key].item() for key in a}

In [86]: a['0']['A']
Out[86]: array([1, 2, 3])
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks unutbu, that is clear. It seems overly complex though to actually store and recover my data in this way in a script, so I am still interested to see how you/others would store data of this type. A pickle maybe? – Louic Mar 26 '14 at 14:22
  • I think the first thing to think about is if you really need a dict of dicts. Is that really the best data structure to use? One disadvantage is that it breaks up your numpy arrays. Small or many NumPy arrays require Python loops whenever you want to perform an operation on each one. You get better performance out of NumPy when you can perform NumPy operations on one big numpy array, because this pushes more of the work into fast underlying C/Fortran functions, and less onto relatively slow Python loops. – unutbu Mar 26 '14 at 18:52
  • You might want to investigate [Pandas](http://pandas.pydata.org) DataFrames, instead. You could use a DataFrame with a multi-index to replace the two levels of dict keys. And you can store the DataFrame in high-performance, compressed format like hdf5. – unutbu Mar 26 '14 at 18:54
  • Another option might be to store the data in a database like [sqlite](http://stackoverflow.com/q/18621513/190597), postgresql or mysql. Pandas can also save/load data to/from databases. – unutbu Mar 26 '14 at 19:06
  • It does not have to be a dict but 'A', 'B', etc. have different sizes and can therefore not be a single array (afaik). There are different models, each identified by a key ('0' in the example above) I will look into Pandas: that seems suitable. Thanks again – Louic Mar 26 '14 at 19:40
  • 1
    As Pandas is built on top of NumPy, it too will perform best if all your data is placed in one big DataFrame. Since your arrays are of different sizes, you could load the 1D arrays into columns, using NaNs to signify missing or non-existent data. If the 1D arrays are of roughly the same size, then this will waste only a little bit of memory and may give you better performance, more convenient syntax, and the ability to store the entire dataset as one DataFrame. – unutbu Mar 26 '14 at 19:53
  • *Why* does the nested dictionary become placed inside of an array? – DilithiumMatrix Dec 24 '14 at 20:43
  • 2
    @zhermes: If you trace through the [source code](https://github.com/numpy/numpy/blob/v1.9.1/numpy/lib/npyio.py#L459) you'll find that [`_savez` calls `np.asanyarray(val)`](https://github.com/numpy/numpy/blob/v1.9.1/numpy/lib/npyio.py#L597) So, when val is a dict, `_savez` converts it to an array. E.g., `np.asanyarray({'a':'b'})` is `array({'a': 'b'}, dtype=object)`. – unutbu Dec 24 '14 at 20:57
3

Based on this answer: recover dict from 0-d numpy array

After

a = {'key': 'val'}
scipy.savez('file.npz', a=a) # note the use of a keyword for ease later

you can use

get = scipy.load('file.npz')
a = get['a'][()] # this is crazy maybe, but true
print a['key']

It would also work without the use of a keyword argument, but I thought this was worth sharing too.

Community
  • 1
  • 1
KeithWM
  • 1,295
  • 10
  • 19