2

I am getting a "Memory Error" on Python when trying to sort a Pandas dataframe and then save it on disk.

df = pd.read_hdf('big_df_file.h5')
df.sort_values(by='opt',inplace=True,kind='quicksort')
df.to_hdf('sorted.h5')

My computer has 16 Gbs of RAM and the data file is 8 Gb. Shouldn't I be able to do this without getting a "Memory Error" ?

P.S. I am using quicksort because it's the sorting algorithm that allocates less memory.

Versions:
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.17.1
Community
  • 1
  • 1
João Abrantes
  • 4,772
  • 4
  • 35
  • 71
  • *When* do you get the `MemoryError`? Are you sure it gives you the error during sorting and not when loading the data? – Bakuriu Feb 01 '16 at 20:17
  • @Bakuriu I placed a few `print` statements and I get the error while sorting. – João Abrantes Feb 01 '16 at 20:23
  • [`numpy.sort`](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.sort.html)'s documentation state that sorting on any but the last axis may create temporary copies of the data. However using plain `numpy` I cannot see any big change in memory. In fact even using `mergesort` I see no real change in memory usage. – Bakuriu Feb 01 '16 at 20:27

1 Answers1

0

We need more information, at what stage does the MemoryError invoked? What type of data is loaded? how much of it is really necessary to perform the sort?

However I will try to address the issue.

In case the error is invoked during the read_hdf, I would suggest maybe limiting the number of columns you load from the file, for example, only load index column (or infer by line enumeration), value column and perform the sort. After that you can (perhaps) incrementally write the new data to a file.

An even more "Hardcore" approach would be a divide and conquer (binary sorting algorithm such as merge sort), load only half of the file (or quarter, you decide what works best, according to the docs this is possible by passing start and stop arguments) and perform the chosen sorting algorithm (merge sort or external sort).

More information on the general problem was provided by this answer

Community
  • 1
  • 1
Fanchi
  • 757
  • 8
  • 23