1

I am storing a large text file (10 GBs, N rows and 4 columns) in an HDF5 file using h5py package. Primarily because I do not want to use my RAM.

I would like to sort the items in the file based on second column. Any suggestions on how to do that?

I also heard that it can be done in chunks, any help on that please?

Thanks!

nuki
  • 101
  • 5
  • Does this help - https://stackoverflow.com/questions/21271727/sorting-in-pandas-for-large-datasets? – bigbounty Jul 23 '20 at 02:15
  • Instead of `h5py`, use Pytables (aka `tables`). It has optimized sort and search algorithms. Both can create and operate on an HDF5 file. (Obviously, you will have to read your text data into the HDF5 file first. There are other SO posts that show how to do that.) – kcw78 Jul 23 '20 at 12:13
  • @kcw78: thanks, I am able to store my data in HDF5 file but I am not able to understand how to sort. Can you please share a MWE ? – nuki Jul 28 '20 at 20:15
  • @bigbounty: this link gives commands, where do I use these commands in my python script? Consider me at a beginner level; would appreciate if you can provide a MWE. – nuki Jul 28 '20 at 20:18

0 Answers0