6

After loading a dataframe from a pickle with ~15 million rows (which occupies ~250 MB), I perform some search operations on it and then delete some rows in place. During these operations the memory usage skyrockets to 5 and sometimes 7 GB, which is annoying because of swapping (my laptop only has 8 GB memory).

The point is that this memory is not freed when the operations are finished (i.e. when the last two lines in the code below are executed). So the Python process still takes up to 7 GB of memory.

Any idea why this happens? I'm using Pandas 0.20.3.

Minimal example below. The 'data' variable in reality would have ~15 million rows but I wouldn't know how to post it here.

import datetime, pandas as pd

data = {'Time':['2013-10-29 00:00:00', '2013-10-29 00:00:08', '2013-11-14 00:00:00'], 'Watts': [0, 48, 0]}
df = pd.DataFrame(data, columns = ['Time', 'Watts'])
# Convert string to datetime
df['Time'] = pd.to_datetime(df['Time'])
# Make column Time as the index of the dataframe
df.index = df['Time']
# Delete the column time
df = df.drop('Time', 1)

# Get the difference in time between two consecutive data points
differences = df.index.to_series().diff()
# Keep only the differences > 60 mins
differences = differences[differences > datetime.timedelta(minutes=60)]
# Get the string of the day of the data points when the data gathering resumed
toRemove = [datetime.datetime.strftime(date, '%Y-%m-%d') for date in differences.index.date]

# Remove data points belonging to the day where the differences was > 60 mins
for dataPoint in toRemove:
    df.drop(df[dataPoint].index, inplace=True)
RiccB
  • 173
  • 3
  • 8

1 Answers1

0

You might want to try invoking the garbage collector. gc.collect() See How can I explicitly free memory in Python? for more information

Ryan Stout
  • 1,038
  • 6
  • 13
  • It actually frees the memory. Therefore, the source of my problem is that the garbage collector is not fast enough and I need to free the memory invoking it manually? – RiccB Jan 24 '18 at 16:48
  • What frees the what memory? (I don't know what "it" is in your comment). If you're freeing the memory, you wouldn't see the 7GB memory consumption. Just because you do something like df.drop doesn't mean the memory has been reclaimed yet – Ryan Stout Jan 24 '18 at 17:38
  • Sorry, I meant that gc.collect() frees my memory. After calling that command the memory consumption goes down to ~290MB. Which is ok considering that the variable 'data' alone occupies ~250MB. – RiccB Jan 24 '18 at 20:51
  • so is there anything more you need in order to accept the answer? – Ryan Stout Jan 24 '18 at 22:07
  • Done! Do you confirm my doubt posed in the first comment to your answer? Also, is it normal that so much memory is used for operations on a 250MB file? – RiccB Jan 24 '18 at 22:38
  • Ah, I see. I think I misunderstood your first question. I don't know much about when python's garbage collector is invoked, unfortunately. I have seen Pandas be a pretty big memory hog in the past. But generally for the types of problems I've worked on I've used numpy arrays directly, so I'm not terribly experienced with pandas and can't offer much of use about how to make pandas not use so much memory or about how to get the gc to kick in automatically – Ryan Stout Jan 24 '18 at 23:11