1

I have a huge set of large boolean 3-dimensional arrays that I need to store. They contain both False and True, but for the purpose of illustration consider the following array with comparable shape as an example

bool_array = np.zeros((20000,20000,5)).astype(np.bool)

When I use

np.save('bool_array.npy', bool_array)

and

bool_array = np.load('bool_array.npy')

The resulting file is over 2 GB and loading times are slow (4 - 5 sec). Note that bool_array is very sparse with any row in any of the five slices containing at most 100 True.

What would be a more memory-efficient and faster alternative (file_format, accounting for sparsity, etc.) to save bool_array?

mbpaulus
  • 7,301
  • 3
  • 29
  • 40
  • Related / possible dup: [packing boolean array needs go throught int (numpy 1.8.2)](https://stackoverflow.com/questions/34511362/packing-boolean-array-needs-go-throught-int-numpy-1-8-2) – jpp Apr 12 '18 at 13:05
  • You might write a function that only stores the indexes of your `True` values. So instead of storing `[False, False, False, True, False, True]` you only store the `True` indexes which are `[3, 5]` in this example. – Patric Apr 12 '18 at 13:07
  • Sparse array (as @Tagas suggests) is one way. Another is to use `np.packbits` - `numpy` uses 1 byte to store each Boolean, but you can effectively use 1 byte to store 8 Booleans via `np.packbits`. So it depends on how sparse your data really is. – jpp Apr 12 '18 at 13:13
  • Cheers, just checked out packbits, pretty cool, now I am exploring sparsity-leveraging solutions. – mbpaulus Apr 12 '18 at 13:25
  • You could try np.savez_compressed also. For me the size went from 200MB to 3MB on switching to this from np.savez – Yesh Mar 15 '19 at 18:36

0 Answers0