Custom pivot function causing high page fault rate and handles

Question

I have a (large) dataframe of about ~17 million rows and 7 columns which I want to transpose (pivot) based on two unique columns. Due to memory limitations, I cannot use the pandas.pivot_table function. Therefore, I sought to create my own piece of code which transposes this dataframe row-by-row. The code can be viewed at: https://bpaste.net/show/xRyQ

Unfortunately after a while my page-fault rate and handle count start increasingly dramatically. Moreover, my non-paged memory goes to basically zero. Currently I am not sure whether or not this is due to a memory leak, or this basically is due to my "new, pivoted dataframe" growing in size and therefore consuming memory.

Therefore, my two core questions would be:

What is the exact cause of the observations I made? Is this due to a memory leak, or is it due to the growing dataframe size?
What changes/improvements can I make to my Python code to fix these memory problems / speed-up my solution? Would e.g. partioniting data by Dask library be an option? I rather not change anything to my hardware specifications.

My hardware specifications are:

16 GB RAM
8 CPU cores, Intel i7-6700 (3.4 gHz)
Windows 7, 64 bits

Thank you in advance, and please let me know if you have any additional questions :)

as you know the size of your pivot_df in advance it may be better to pre-allocate the whole pivot_df and then fill in the rows instead of appending row by row that causes copying the dataframe each time (see https://stackoverflow.com/a/24913075/3944322) — Stef, Jul 10 '19 at 15:44
@Stef, thank you so much. If you like to, you can submit this as an answer and I will accept it straight away. Instead of appending to a `pandas.DataFrame` I created a list of dictionaries (each dict representing a row), ultimately converting this to a `pandas.DataFrame`. This speeded it up dramatically. — wptmdoorn, Jul 10 '19 at 19:00

score 1 · Accepted Answer · answered Jul 10 '19 at 19:16

1

As you know the size of your pivot_df in advance, it might be better to pre-allocate the whole pivot_df and then fill in the rows instead of appending row by row that causes copying the dataframe each time (see also this answer).

answered Jul 10 '19 at 19:16

Stef

28,728
2
24
52

Custom pivot function causing high page fault rate and handles

1 Answers1