When working with a large data set which takes up ~10GB memory when loaded as a pandas DataFrame, I noticed that slicing it with booleans seemed to take up as much memory as making a copy, e.g.:
dfsub = df[df['date']<20201001]
which supposedly returns a view of about half of the original DataFrame actually caused the memory usage to go up 50% to ~15GB (using top
to observe in real time).
If I only take a few rows with:
dfsub = df[df['date']==20201001]
the memory footprint goes back down to 10GB.
Shouldn't the slicing only return a view of the original DataFrame that doesn't actually copy the underlying data? I guess I just don't understand how DataFrame 'views' work. I thought it meant merely an indexing on the same underlying data store which should incur minimal overhead. Otherwise, what's the point of the 'views' vs. 'copies'?
In addition, I noticed that even when selecting a tiny slice, the first call momentarily doubles the memory usage to ~20GB but then goes back down to ~10GB. Subsequent calls do not incur this transient memory spike.
(pandas 0.23.4 on python 3.4.3)