0

I am currently having a list with some 700K tuples of (4 ints and 1 string). The list pickled file took about 160M and took about 1 sec to load from the hard drive and took another 1.6 sec for creating a numpy object, total about 2.6 sec

When this list data in numpy.array object, it took 3.2G without dtype=type declaration and 2 sec to load. while it is 180Mish as in dtype=object, but took 16 sec to load

Is there any better way to make these to be much faster and cost not huge space?

==========================================

here some testing results from using h5py

the string in the tuple have to be encoded, thus, all the int will also be saved as byte array, ends up as 800M... not sure if I have done something wrong here. but just took about half a sec to load

if saved as numpy.array, with the string part dtype declared as h5py.string_dtype('utf-8'), it ends up as 270M, but took around 2.4s to load

===================================

tests on numpy.savez, string to be described as S40 = 190M, loading time = 0.6 sec

====================================

user2625363
  • 845
  • 2
  • 10
  • 15
  • 1
    Save the ints in a separate array. – user2357112 Sep 21 '21 at 00:35
  • How long are the strings? – mkrieger1 Sep 21 '21 at 00:35
  • Also consider using a database (e.g. SQLite). – mkrieger1 Sep 21 '21 at 00:37
  • 1
    See https://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk . Pickle is not the best format for space nor for performance (but quite good for portability which is not always required). NPY/Binary format are far much faster and space efficient. Not to mention they can be combined with a very fast compression algorithm like LZ4. Note that non-native types (ie. pure Python types) should really be avoided as they are slow and manipulate and heavy (without to mention GIL issues). – Jérôme Richard Sep 21 '21 at 00:41
  • What kind of array are you making, object dtype or structured? object dtype has little advantage over a list. – hpaulj Sep 21 '21 at 00:43
  • strings are between 0 to 30 chars – user2625363 Sep 21 '21 at 06:37
  • @Jérôme Richard, please have a look on the updates on this question – user2625363 Sep 21 '21 at 08:06

1 Answers1

0

Since you say your strings are up to 30 characters, use a structured dtype like this:

dtype = [('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'i4'), ('e', 'S30')]

Then each record is 46 bytes long and performance should be much better than with dtype=object.

Or, if you need to access a single column at a time, better store each column in a separate array. You can save them all together as a single .npz file which is just a Zip archive of .npy files using np.savez() with optional compression.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Hi. I am sorry that the testing string was from 1-40, I have updated my question with the some testing result. seems like it is still bigger than the testing pickled object in list format, but certainly it was fast in numpy. just another questions, how can I retrieve the max len of the string out of all? because your solution requested to declare the S with the max number, otherwise, the over length part will be chopped.. – user2625363 Sep 21 '21 at 11:51
  • @user2625363 Right, you'll need to iterate once over all your data to determine the maximum string length if it is not known in advance. A simple `for` loop will do just fine. – John Zwinck Sep 24 '21 at 09:49