0

I have a (large) dataframe of about ~17 million rows and 7 columns which I want to transpose (pivot) based on two unique columns. Due to memory limitations, I cannot use the pandas.pivot_table function. Therefore, I sought to create my own piece of code which transposes this dataframe row-by-row. The code can be viewed at: https://bpaste.net/show/xRyQ

Unfortunately after a while my page-fault rate and handle count start increasingly dramatically. Moreover, my non-paged memory goes to basically zero. Currently I am not sure whether or not this is due to a memory leak, or this basically is due to my "new, pivoted dataframe" growing in size and therefore consuming memory.

Therefore, my two core questions would be:

  • What is the exact cause of the observations I made? Is this due to a memory leak, or is it due to the growing dataframe size?
  • What changes/improvements can I make to my Python code to fix these memory problems / speed-up my solution? Would e.g. partioniting data by Dask library be an option? I rather not change anything to my hardware specifications.

My hardware specifications are:

  • 16 GB RAM
  • 8 CPU cores, Intel i7-6700 (3.4 gHz)
  • Windows 7, 64 bits

Thank you in advance, and please let me know if you have any additional questions :)

wptmdoorn
  • 160
  • 1
  • 12
  • as you know the size of your pivot_df in advance it may be better to pre-allocate the whole pivot_df and then fill in the rows instead of appending row by row that causes copying the dataframe each time (see https://stackoverflow.com/a/24913075/3944322) – Stef Jul 10 '19 at 15:44
  • @Stef, thank you so much. If you like to, you can submit this as an answer and I will accept it straight away. Instead of appending to a `pandas.DataFrame` I created a list of dictionaries (each dict representing a row), ultimately converting this to a `pandas.DataFrame`. This speeded it up dramatically. – wptmdoorn Jul 10 '19 at 19:00

1 Answers1

1

As you know the size of your pivot_df in advance, it might be better to pre-allocate the whole pivot_df and then fill in the rows instead of appending row by row that causes copying the dataframe each time (see also this answer).

Stef
  • 28,728
  • 2
  • 24
  • 52