I have a python script that scrapes some urls. I have a list of urls and for each url I get html and do some logic with it.
I use Python 2.7.6 and Linux Mint 17 Cinnamon 64-bit.
Problem is that my main object for scraping, which I instance for every url, is never released from memory although there is no reference to it. With that issue my memory just keeps growing and growing rapidly (since my object is sometimes very big - up to 50MB).
Simplify code looks something like this:
def scrape_url(url):
"""
Simple helper method for scraping url
:param url: url for scraping
:return: some result
"""
scraper = Scraper(url) # instance main Scrape object
result = scraper.scrape() # scrape it
return result
## SCRIPT STARTS HERE
urls = get_urls() # fetch some list of urls
for url in urls:
print 'MEMORY USAGE BEFORE SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
result = scrape_url(url) # call helper method for scraping
print 'MEMORY USAGE AFTER SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print '-' * 50
My output is something like this:
MEMORY USAGE BEFORE SCRAPE: 75732 (kb)
MEMORY USAGE AFTER SCRAPE: 137392 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 137392 (kb)
MEMORY USAGE AFTER SCRAPE: 206748 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 206748 (kb)
MEMORY USAGE AFTER SCRAPE: 284348 (kb)
--------------------------------------------------
Scrape object is big and it is not released from memory. I tried:
scraper = None
del scraper
or even call gc to collect object with :
gc.collect()
but nothing helped.
When I print number of references to scraper object with:
print sys.getrefcount(scraper)
I get 2 which I think means that there is no other references to object and should be cleaned by gc.
Scraper object has lots of subobjects. Is is possible that some of it sub object's reference get left somewhere and for that reason gc cannot release main Scaper object or there is some other reason why python doesn't release memory?
I found some topic regarding this in SO and some of the responses where they are talking that memory cannot be released unless you are spawning/killing child processes which sounds really strange (LINK)
Thanks, Ivan