4

I have a python script that scrapes some urls. I have a list of urls and for each url I get html and do some logic with it.

I use Python 2.7.6 and Linux Mint 17 Cinnamon 64-bit.

Problem is that my main object for scraping, which I instance for every url, is never released from memory although there is no reference to it. With that issue my memory just keeps growing and growing rapidly (since my object is sometimes very big - up to 50MB).

Simplify code looks something like this:

def scrape_url(url):
    """
    Simple helper method for scraping url
    :param url: url for scraping
    :return: some result
    """
    scraper = Scraper(url)  # instance main Scrape object
    result = scraper.scrape()  # scrape it

    return result

## SCRIPT STARTS HERE
urls = get_urls()  # fetch some list of urls

for url in urls:
    print 'MEMORY USAGE BEFORE SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    result = scrape_url(url)  # call helper method for scraping
    print 'MEMORY USAGE AFTER SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print '-' * 50

My output is something like this:

MEMORY USAGE BEFORE SCRAPE: 75732 (kb)
MEMORY USAGE AFTER SCRAPE: 137392 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 137392 (kb)
MEMORY USAGE AFTER SCRAPE: 206748 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 206748 (kb)
MEMORY USAGE AFTER SCRAPE: 284348 (kb)
--------------------------------------------------

Scrape object is big and it is not released from memory. I tried:

scraper = None

del scraper

or even call gc to collect object with :

gc.collect()

but nothing helped.

When I print number of references to scraper object with:

print sys.getrefcount(scraper)

I get 2 which I think means that there is no other references to object and should be cleaned by gc.

Scraper object has lots of subobjects. Is is possible that some of it sub object's reference get left somewhere and for that reason gc cannot release main Scaper object or there is some other reason why python doesn't release memory?

I found some topic regarding this in SO and some of the responses where they are talking that memory cannot be released unless you are spawning/killing child processes which sounds really strange (LINK)

Thanks, Ivan

Community
  • 1
  • 1
Ivan Longin
  • 3,207
  • 4
  • 33
  • 42
  • 1
    "Scraper object has lots of subobjects... doesn't release memory?" that would be the only plausible reason. scrape url established a connection on a port I assume? Probably that connection holds the standing reference. – user2255757 Feb 11 '16 at 15:16
  • are you sure that result is not connected with scrapper? – Jerzyk Jul 01 '16 at 09:36

1 Answers1

1

You are using an iterator, which has to be in memory at all times. Rewrite your loop to use generator and lazily scrape. Something along the lines of:

def gen():
        for i in xrange(0, len(urls)):
            yield urls[i]