I am currently having a list with some 700K tuples of (4 ints and 1 string). The list pickled file took about 160M and took about 1 sec to load from the hard drive and took another 1.6 sec for creating a numpy object, total about 2.6 sec
When this list data in numpy.array object, it took 3.2G without dtype=type declaration and 2 sec to load. while it is 180Mish as in dtype=object, but took 16 sec to load
Is there any better way to make these to be much faster and cost not huge space?
==========================================
here some testing results from using h5py
the string in the tuple have to be encoded, thus, all the int will also be saved as byte array, ends up as 800M... not sure if I have done something wrong here. but just took about half a sec to load
if saved as numpy.array, with the string part dtype declared as h5py.string_dtype('utf-8'), it ends up as 270M, but took around 2.4s to load
===================================
tests on numpy.savez, string to be described as S40 = 190M, loading time = 0.6 sec
====================================