I have a (large) dataframe of about ~17 million rows and 7 columns which I want to transpose (pivot) based on two unique columns. Due to memory limitations, I cannot use the pandas.pivot_table
function. Therefore, I sought to create my own piece of code which transposes this dataframe row-by-row. The code can be viewed at: https://bpaste.net/show/xRyQ
Unfortunately after a while my page-fault rate and handle count start increasingly dramatically. Moreover, my non-paged memory goes to basically zero. Currently I am not sure whether or not this is due to a memory leak, or this basically is due to my "new, pivoted dataframe" growing in size and therefore consuming memory.
Therefore, my two core questions would be:
- What is the exact cause of the observations I made? Is this due to a memory leak, or is it due to the growing dataframe size?
- What changes/improvements can I make to my Python code to fix these memory problems / speed-up my solution? Would e.g. partioniting data by Dask library be an option? I rather not change anything to my hardware specifications.
My hardware specifications are:
- 16 GB RAM
- 8 CPU cores, Intel i7-6700 (3.4 gHz)
- Windows 7, 64 bits
Thank you in advance, and please let me know if you have any additional questions :)