1

I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects. My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:

properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}

The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names. The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time. My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).

I have noticed that when I do np.savez('filename.npz',**properties) on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).

Is there a more efficient data structure widely applicable for my use case? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.

quantumflash
  • 691
  • 2
  • 5
  • 16
  • Are you concerned mainly with save/load "efficiency", or are there calculations that you are doing with the arrays that would work better if they were combined into one higher dimensional array? Are the innermost arrays all the same shape? – hpaulj Jan 07 '23 at 20:42
  • With that `savez`, it makes a `npy` file for each of the outer keys. And yes each file will be a pickle of the dict with inner keys. I don't think that hurts the file space too much, though I haven't examined the memory use of a `dict` pickle. The `pickle` of an array is basically its `save` `npy` file. – hpaulj Jan 07 '23 at 20:46
  • I'm concerned with save efficiency (write speed and filesize) and just generally to know if my nested dict approach is smart or not. The inner arrays are all the same size yes. So I could just create a huge 3d numpy array where each column is a different property, each row gives the property values at a different time, and you stack these 2d arrays for different objects along the 3rd dimension. this should be a more efficient data structure but I would like to retain a "header" of the name of each column and the ID of each object along the 3rd axis. Some columns may be strings or have nan/inf – quantumflash Jan 07 '23 at 21:41
  • arrays don't have 'headers'. Dataframes do. But whether a frame with array elements is any more 'efficient' is unknown. – hpaulj Jan 07 '23 at 22:07
  • Knowing the sizes you can estimate the total memory use, at least for the array data - and along with it the total file(s). E.g. `100000*100*1000*8/1e9` is 80 GB of data. I assume the nested `dict` part of the store will be measured in MB, the size of the `keys`, In memory `dict` use some sort hash table, which takes up some space, but I don't know how that is coded, if at all in the `pickle`. Unpickling might have to recreate the hash table from the equivalent lists (`list(d.items())`). – hpaulj Jan 07 '23 at 22:21
  • Hmm interesting. I think one thing I want to look into is how to convert my nested dict into a pandas dataframe with multi-indexing (so that, as i said, i can still access each object's 2D array using its name, and then the dataframe can also allow me to keep having names associated with each of my columns rather than just numerical indices). And then finally just use pandas own .to_hdf function to save the dataframe into an hdf5 file. (And i can split my big 80 GB file into individual files with groups of objects to limit the individual filesize.) This way I also get hdf5 compressibility. – quantumflash Jan 07 '23 at 22:58
  • the comments on the accepted answer here are relevant (about nested dict vs pandas multi-index dataframe): https://stackoverflow.com/questions/22661764/storing-a-dict-with-np-savez-gives-unexpected-result – quantumflash Jan 07 '23 at 22:59
  • Using dicts is not a good idea for large datasets: this consume a lot of memory (due to the repeated keys) and dicts objects are inefficient. Pandas dataframe are much more compact and often more efficient, except for string or other object-based columns. Here, it looks like the content is Numpy array of different size so Pandas will store them as object. Thus, this will be still a bit more compact than dict (no need to repeat the key for each row) but not efficient due to the objects stored in each columns. Jagged arrays are also inefficient in Numpy. – Jérôme Richard Jan 07 '23 at 23:34
  • While there are a bit more compact data structures, and more efficient ones, they may also be significantly less user-friendly. Besides, for information are missing: what do you plan to do with the data structure? Is is only read, or do you plan to change it, and if yes, how? Are all the arrays of type float? Is is possible for you to reduce the precision from 64-bit floats to 362-bit ones? What kind of operation are you planning to do on it? – Jérôme Richard Jan 07 '23 at 23:39
  • @JérômeRichard, I'm not sure that the dict is that inefficient, especially in case like this. If the subdict keys are all the same, and especially if they are short strings like the example, they won't take up much memory. Each subdict will just have references to the same small set of strings. In my answer, I found that a pickle.dumps for one of these subdicts is about the same size as its `list(dd.values())`, and smaller than either a dataframe or even a recarray. – hpaulj Jan 08 '23 at 00:16
  • @hpaulj The thing is using pickle in the first place is certainly not efficient though there is not much better for dicts. I do not think it efficiently writes Numpy array (this is what your answer seems to indicate). It looks like dataframe are also inefficiently stored. It looks like data are stored as objects like for lists. For dataframe, HDF5 is probably better though it may not be optimal for small arrays and in this case. For dataframes, there is parquet that should be better. – Jérôme Richard Jan 08 '23 at 01:53
  • Perhaps the inner dicts could be a [`namedtuple`](https://docs.python.org/3/library/collections.html#collections.namedtuple) instead. Once you've gathered the data, you create a single numpy array. Then there are more efficient ways to store, like parquet. – tdelaney Jan 08 '23 at 02:32
  • 1
    @JérômeRichard, though the `np.save` of this array is 612 bytes; the array is small enough that the `npy` header adds 50%.. I had assumed, or read, that apart from headers, pickle of a numpy arrays was the same as np.save. – hpaulj Jan 08 '23 at 02:33

1 Answers1

2

Let's look at your obj2 value, a dict:

In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}

In [308]: dd
Out[308]: 
{'time': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 'x': array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
        -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
        -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519]),
 'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
         1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
        -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012])}

It's easy to make a dataframe from that:

In [309]: df = pd.DataFrame(dd)

In [310]: df
Out[310]: 
    time         x        vx
0      0 -0.481979 -1.602281
1      1  0.155978 -1.491630
2      2  0.441134 -1.170610
3      3  1.380628 -0.092675
4      4 -1.212734 -0.941331
5      5 -1.271200  1.863910
6      6  1.530727  1.006901
7      7  1.979926 -0.161684
8      8  0.136479  1.518014
9      9 -1.370568 -1.164364
10    10 -2.064708 -0.202543
11    11  0.923150 -1.602801
12    12  0.308854 -1.917494
13    13  0.648600  0.253666
14    14  1.302735 -1.619930

We could also make structured array from that frame. I could also make the array directly from your dict, defining the same compound dtype. But since I already have the frame, I'll go this route. The distinction between structured array and recarray is minor.

In [312]: arr = df.to_records()

In [313]: arr
Out[313]: 
rec.array([( 0,  0, -0.48197915, -1.60228105),
           ( 1,  1,  0.15597792, -1.49163002),
           ( 2,  2,  0.44113401, -1.17061046),
           ( 3,  3,  1.38062753, -0.09267467),
           ( 4,  4, -1.21273378, -0.94133092),
           ( 5,  5, -1.27120008,  1.86391024),
           ( 6,  6,  1.53072667,  1.006901  ),
           ( 7,  7,  1.9799255 , -0.16168439),
           ( 8,  8,  0.13647925,  1.5180135 ),
           ( 9,  9, -1.37056793, -1.16436363),
           (10, 10, -2.06470784, -0.20254291),
           (11, 11,  0.92314969, -1.60280149),
           (12, 12,  0.30885371, -1.91749387),
           (13, 13,  0.64860014,  0.25366602),
           (14, 14,  1.30273519, -1.61993012)],
          dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])

Now let's compare the pickle strings:

In [314]: import pickle

In [315]: len(pickle.dumps(dd))
Out[315]: 561

In [316]: len(pickle.dumps(df))      # df.to_pickle makes a 1079 byte file
Out[316]: 1052

In [317]: len(pickle.dumps(arr))     # arr.nbytes is 420
Out[317]: 738                        # np.save writes a 612 byte file

And other encoding - a list:

In [318]: alist = list(dd.items())
In [319]: alist
Out[319]: 
[('time', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 ('x',
  array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
         -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
         -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519])),
 ('vx',
  array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
          1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
         -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012]))]
In [320]: len(pickle.dumps(alist))
Out[320]: 567
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks! What if I wanted to convert my nested dict ```properties``` into a "nested pandas dataframe" (or I guess a series of dataframes?). Would that be an efficient data structure, especially if all objects had the same size arrays so you could just stack along a 3rd dimension? In other words: outer dataframe/series key = object #/string and the value for any object is a 2D dataframe like your ```df``` above, and further assume all objects have the same length columns for their own df. I would probably save as hdf5 (w/ or w/o compression depending on save/load time). – quantumflash Jan 16 '23 at 23:49
  • You may be wanting pandas multiindexing. Pandas cells (Series) can be object dtype and contain lists, arrays or strings, but they aren't efficient (nothing like multidimensional arrays), and can be a pain to save/load (`csv` file format is inherently 2d). But that's well beyond my pandas expertise. – hpaulj Jan 17 '23 at 00:29