Speed analysis/prediction for hdf5 (h5py or pytables) vs numpy memmep (vs other) for retrieving 100s of rows. 30 million rows total, each of 512 ints

Question

I have 30 million rows of data. Each contains an int array of size 512. Each int can have values from 0 to 50,500.

I'll need to retrieve about 100 rows simultaneously in one instance. I am wondering which data store will give the fastest retrieval for this.

It seems that the best datastores for this type of situation are hdf5 and numpy memmaps.

I am wondering if there is some sort of analysis or prediction for what may be faster for my situation.

But the situation is quite different from mine.

Is the data here helpful? https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d — BallpointBen, May 12 '20 at 01:42
Do you want to access a slice of rows or some random rows from the dataset? Is the data compressible (not completely random)? How do you write the file (column or row-wise)? eg. h5py: https://stackoverflow.com/a/48405220/4045774 But there are a lot faster compression algorithms available than lzf.... — max9111, May 13 '20 at 14:27
I want to access random rows; the list of rows could be [32, 1, 48390202, 34543, 3244, etc]. The data represents nodes on a graph, so its not random, but I don't think it's compressible. For now we have written it row-wise, but we don't really care if it's row or column wite. — SantoshGupta7, May 14 '20 at 04:21

0 Answers0