3

I'm creating datset of windowed data for deep learning. I generated the data as numpy arrays 4 arrays with shape (141038, 360) and 1 array for labels of shape (141038, ). I saved the arrays in npz file but the file size is too big 1.5 GB. I'm new to python and programming so don't have idea how big should the file size be. However I converted the arrays to Pandas dataframes and Memory usage was in the same range. The Problem that I have 6 files with 9 GB and probably another dataset with overlapping which is 7 times larger so it would be 63 GB possibly.

  • Is such a file size realistic or have I done something wrong? (it's just a file with some numbers not a game)

  • Is there another format to save my arrays with less memory usage? (I tried HFD5 but I got the same file size)

  • I tried to change datatypes and it reduced the size slightly. (3 arrays (f8), 1 (int8), 1 (uint8)) is there other datatypes which could reduce the size more? for 0/1 values is there another datatype more efficient than (uint)?

  • For float arrays if I reduce the precision, would it help? or there is another way to reduce their size?

  • I have some files filled with Zero padding ,some with edge padding and others with Interpolation. However all files almost have the same size, shouldn't the files with Zero padding have less size?

Hishi51
  • 55
  • 3
  • 9
  • Your arrays contain about 50 million values each, of cause that is going to take some space. And BTW you went far beyong the one question allowed per question. – Klaus D. May 04 '20 at 08:37

1 Answers1

3
  1. Yes, if you're using float type data, it definitely is.

  2. You can try numpy.savez_compressed to save as a compressed array.

ref: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez_compressed.html

you can use gzip too but the compression algorithm is important.

import gzip
import numpy

f = gzip.GzipFile("x.npy.gz", "w")
numpy.save(file=f, arr=x)
f.close()

this may be useful: Compress numpy arrays efficiently

  1. For binary data, uint8 seems a lot of waste. In fact, you can store 8 values (0/1) in one uin8. Just consider the 0, 1 as bits and you can encode 8 bits in a single uint8 with simple binary operations.

You can use 'boolean' to store 0/1 values.

import numpy as np
import sys

b = np.array([0, 1, 0]*50000, dtype='b')

print(sys.getsizeof(b))

u8 = np.array([0, 1, 0]*50000, dtype='u8')

print(sys.getsizeof(u8))
150096
1200096
  1. Yes, definitely. If you consider lossy compression an option, you can compress the array with a good factor.

  2. Doesn't matter, only matters is the shape and data types. Numpy arrays are not compressed. If you compare it with images - that'd be wrong, analogy like "a black image has less size because of uniformity so zero padded arrays should consume less space" - is irrelevant (images are usually lossy compressed JPEG).

Zabir Al Nazi
  • 10,298
  • 4
  • 33
  • 60