3

I use arrays stored in Python .npz format. I have a lot of these files, which all share the same common structure: filename my_file_var1_var2_var3.npz contains the following items (all arrays are 32bit Floats):

  • a 2D array (N=11, Ns=2000)
  • a 2D array (12, N )
  • a 2D array ( 300, N )
  • a 2D array ( 300, Ns )
  • a float
  • an integer

It's quite annoying to have in excess of 1000 files, and each of them ends up taking some 4Mb. I was thinking that it would be good to shift them to a container, like HDF5/Pytables or similar. The different arrays are just arrays, there's not preferential ordering or anything (they are effectively matrices or stacks of vectors that will be operated on). All the arrays for each filename are required together simultaneously.

Are there any recommendations on what formats would be better to retrieve the arrays associated with var1, var2 and var3, that is portable and efficient with storage

Jose
  • 2,089
  • 2
  • 23
  • 29

1 Answers1

2

Storing your dataset in HDF5 format with PyTables would definitely make sense here (see for instance this example).

Not only it will put all your data in the same container, but you can also get compression, efficient querying and possibly a faster read/write access with BLOSC.

Because your items have a variable shape, you can't put all the items of the same type in a common array. So you have several choices there,

  1. Save each array as a separate HDF5 node
  2. If N is variable but has some reasonable maximum value N_max (say 20 or 30), you can just create unique arrays of size (number_of_items, ..., Nmax) and fill the elements you don't need with zeros by default. Surprisingly this could be more efficient if you need to query all the items at the same time, and you won't see the size overhead if you use compression.
rth
  • 10,680
  • 7
  • 53
  • 77