3

NB: This is my first foray into memory profiling with Python, so perhaps I'm asking the wrong question here. Advice re improving the question appreciated.

I'm working on some code where I need to store a few million small strings in a set. This, according to top, is using ~3x the amount of memory reported by heapy. I'm not clear what all this extra memory is used for and how I can go about figuring out whether I can - and if so how to - reduce the footprint.

memtest.py:

from guppy import hpy
import gc

hp = hpy()

# do setup here - open files & init the class that holds the data

print 'gc', gc.collect()
hp.setrelheap()
raw_input('relheap set - enter to continue') # top shows 14MB resident for python

# load data from files into the class

print 'gc', gc.collect()
h = hp.heap()
print h

raw_input('enter to quit') # top shows 743MB resident for python

The output is:

$ python memtest.py 
gc 5
relheap set - enter to continue
gc 2
Partition of a set of 3197065 objects. Total size = 263570944 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 3197061 100 263570168 100 263570168 100 str
     1      1   0      448   0 263570616 100 types.FrameType
     2      1   0      280   0 263570896 100 dict (no owner)
     3      1   0       24   0 263570920 100 float
     4      1   0       24   0 263570944 100 int

So in summary, heapy shows 264MB while top shows 743MB. What's using the extra 500MB?

Update:
I'm running 64 bit python on Ubuntu 12.04 in VirtualBox in Windows 7.
I installed guppy as per the answer here:
sudo pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy

Community
  • 1
  • 1
elhefe
  • 3,404
  • 3
  • 31
  • 45
  • First of all, what platform are you on? If it's 64-bit, are you using a 32-bit or 64-bit Python? Anyway, my guess is that heapy reports current heap usage in C malloc terms, and your interpreter has `malloc`'d then `free`'d 500MB that hasn't been reclaimed by the OS yet, but that's just a guess. – abarnert Sep 13 '12 at 00:09
  • I just tried to `pip install guppy` to try it out myself. On my 64-bit Mac, I get tons of warnings because it's assuming sizeof(unsigned long) is 4 or converting unsigned long to int. And then, as soon as you try to import it, you get a whole string of errors. All this despite the fact that it's version 0.1.9, and starting at 0.1.8 it supposedly 'Works with 64 bits and Python 2.6'. Maybe it doesn't work with Python 2.7? Maybe it just doesn't work? – abarnert Sep 13 '12 at 00:13
  • @abarnert updated with some answers for you. Is there any way to determine if the memory is just un'free'd - that'd be nice if this isn't really a problem. I should've noted I used an alternate install as apparently the standard install doesn't work on py 2.7. Whether this one does, I can't say - but it does give more reasonable results, whereas I don't see how I could be using the amount of memory top says given what I'm actually storing. – elhefe Sep 13 '12 at 00:31
  • @abarnert addendum - if heapy doesn't work, I guess the question becomes 'how do I figure out what's using so much memory and fix it?' Heapy was the best (only?) solution I've found for that so far. – elhefe Sep 13 '12 at 00:33
  • I tried installing the trunk version instead of the latest release (using your command line), and… now I get different 64-bit warnings, but they look just as serious. Anyway, is there more documentation of what heapy does than the 1-page tutorial, a PDF that won't render in Acrobat or Preview, and a 404 link at pkgcore? – abarnert Sep 13 '12 at 00:46
  • For non-guppy-based possibilities, have you tried searching, or looking at the "Related" links to the right? There are some good discussions of how you can('t) make sure memory is freed (or control whether freed memory is returned to the OS), and lower-level ways to get memory use statistics on linux, and so on. That may be more helpful than just figuring out why heapy and top return different numbers… – abarnert Sep 13 '12 at 00:50
  • @abarnert According to [this SO question](http://stackoverflow.com/questions/110259/python-memory-profiler) heapy documentation is not too good, and that was my experience as well. – elhefe Sep 13 '12 at 00:57
  • @abarnet I've spent most of today reading various SO memory questions :/ I just looked through a bunch of the questions related to freeing memory, and it seems as though I won't be able to determine if the 500MB is just unreclaimed memory or an actual problem with my code, which is a real drag. Not quite sure where to go from here. – elhefe Sep 13 '12 at 01:13
  • Well, where to go depends what you're trying to accomplish. (Is there an actual problem?) The best way to avoid keeping lots of memory alive is to avoid using it in the first place. Maybe you can load and construct the data using iterators or smallish-buffer reads instead of all at once. If the memory use is unavoidable, your best bet is usually to create a temporary child process that does the memory-intensive work, and have it feed results back to the parent iteratively (or via a mmapped file). There's really nothing Python-specific here, so you can look farther afield for information. – abarnert Sep 13 '12 at 19:40
  • @abarnet Your parenthetical remark hits the nail on the head - what I'd first like is to understand whether I really have a problem. If I trust heapy and assume the 500MB is simply unreclaimed, I don't. However, I don't understand how to move forward on determining whether that's a valid assumption. BTW thanks for all the comments so far, much appreciated. – elhefe Sep 14 '12 at 00:59
  • OK, but again, what is the problem you're worried about. "Top shows 743MB" doesn't mean you're in danger of running out of vmem space and crashing, or thrashing swap, or starving other processes out of physical memory, or anything else; unless your worry is "people who look at top will think I'm hogging memory" it scarcely matters. Do you have an actual problem you can identify, or are you trying to solve any problem that might exist? At any rate, if you do need to do anything, no matter what it is, the answers are probably the same—iterative processing, or child processes. – abarnert Sep 14 '12 at 19:20
  • @abarnet ...True. I suppose I saw the 500MB difference and the 'I have to fix that!' light went off in my head. So far I haven't seen any actual physical symptoms, as you say. I'm running some stress tests where I load the full data set now to see if that causes any issues. – elhefe Sep 14 '12 at 23:24
  • I use `getrusage` from `resource` an example I use is here in ruse.py: https://gist.github.com/raw/4285986/cb24f788cedaf190b56065dc4c9e8d03e63ee534/ruse.py – lukecampbell Jan 04 '13 at 21:20

0 Answers0