0

I have 30 million rows of data. Each contains an int array of size 512. Each int can have values from 0 to 50,500.

I'll need to retrieve about 100 rows simultaneously in one instance. I am wondering which data store will give the fastest retrieval for this.

It seems that the best datastores for this type of situation are hdf5 and numpy memmaps.

I am wondering if there is some sort of analysis or prediction for what may be faster for my situation.

I saw this Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)?

But the situation is quite different from mine.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • 1
    Is the data here helpful? https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d – BallpointBen May 12 '20 at 01:42
  • 1
    Do you want to access a slice of rows or some random rows from the dataset? Is the data compressible (not completely random)? How do you write the file (column or row-wise)? eg. h5py: https://stackoverflow.com/a/48405220/4045774 But there are a lot faster compression algorithms available than lzf.... – max9111 May 13 '20 at 14:27
  • I want to access random rows; the list of rows could be [32, 1, 48390202, 34543, 3244, etc]. The data represents nodes on a graph, so its not random, but I don't think it's compressible. For now we have written it row-wise, but we don't really care if it's row or column wite. – SantoshGupta7 May 14 '20 at 04:21

0 Answers0