2

I have the following problem:

I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.

My input is the file names and an ordered list of columns I want to use for sorting. The output should be a single hdf5 file containing all the sorted data.

Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.

Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.

Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?

I have already seen ptrepack but it seems to allow you sorting only on a single column.

Luca Fiaschi
  • 3,145
  • 7
  • 31
  • 44

0 Answers0