0

When working with a large data set which takes up ~10GB memory when loaded as a pandas DataFrame, I noticed that slicing it with booleans seemed to take up as much memory as making a copy, e.g.:

dfsub = df[df['date']<20201001]

which supposedly returns a view of about half of the original DataFrame actually caused the memory usage to go up 50% to ~15GB (using top to observe in real time).

If I only take a few rows with:

dfsub = df[df['date']==20201001]

the memory footprint goes back down to 10GB.

Shouldn't the slicing only return a view of the original DataFrame that doesn't actually copy the underlying data? I guess I just don't understand how DataFrame 'views' work. I thought it meant merely an indexing on the same underlying data store which should incur minimal overhead. Otherwise, what's the point of the 'views' vs. 'copies'?

In addition, I noticed that even when selecting a tiny slice, the first call momentarily doubles the memory usage to ~20GB but then goes back down to ~10GB. Subsequent calls do not incur this transient memory spike.

(pandas 0.23.4 on python 3.4.3)

Kevin S.
  • 1,190
  • 3
  • 11
  • 22

1 Answers1

0

Someone correct me if I'm wrong but this looks like it's happening because your saving the view into memory. I've had similar problems before where my memory was increasing drastically and my program would crash because I ran out of memory. Try using the view in place in your program where needed, this worked for me and allowed my program to run and not run out of memory. For example:

dfsub = df[df['date']==20201001]
for i,row in dfsub.iterrows():
    print(row)

could be replaced with:

for i,row in df[df['date']==20201001].iterrows():
    print(row)

This would use the slicing as view since it's not saving it to an additional memory location.

bellerb
  • 137
  • 8
  • My test does not use `dfsub` in anyway (or I'd agree that the subsequent usage might cause a copy of data). In your example, are you suggesting that merely assigning the slice/view to a variable `dfsub` is causing data to be copied or are you suggesting the subsequent iteration is causing it? – Kevin S. Sep 30 '21 at 19:54
  • I believe it's the fact that it's being assigned to a variable that's what the problem was for me in the past. – bellerb Sep 30 '21 at 19:56
  • python is a reference based ala java so I doubt that is the case. Actually my original call was in place `df[df['date']==20201001].to_csv(...)` which had the memory explosion. I removed `to_csv` to only see the effect of slicing in case `to_csv` did any behind-the-scene copying. – Kevin S. Sep 30 '21 at 20:02
  • Oh okay that makes sense then how you are using it, when you removed to_csv did the memory still explode? – bellerb Sep 30 '21 at 20:04
  • Yes, it's especially strange on the first slicing call which momentarily doubles memory usage. Subsequent slicing calls only increase the memory by roughly the amount of data in the slice. – Kevin S. Sep 30 '21 at 20:19
  • I've started doing some reading now on this and it seems your not alone with being confused about what is a view/copy. For some reason a copy is being made instead of a view when you make your slice. What's the data type of dfsub since from what I'm reading this can affect it being a copy instead of a view. This seems to be a similar question to yours that might help https://stackoverflow.com/questions/23296282/what-rules-does-pandas-use-to-generate-a-view-vs-a-copy – bellerb Sep 30 '21 at 20:23