0

Basically, I pull a series of links from my database, and want to scrape them for specific links I'm looking for. I then re-feed those links into my link queue that my multiple QWebViews reference, and they continue to pull those down for processing/storage.

My issue is that as this runs for... say 200 or 500 links, it starts to use up more and more RAM.

I have exhaustively looked into this, using heapy, memory_profiler, and objgraph to figure out what's causing the memory leak... The python heap's objects stay about the the same in terms of amount AND size over time. This made me think the C++ objects weren't getting removed. Sure enough, using memory_profiler, the RAM only goes up when the self.load(self.url) lines of code are called. I've tried to fix this, but to no avail.

Code:

from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebView, QWebSettings
from PyQt4.QtGui import QApplication
from lxml.etree import HTMLParser

# My functions
from util import dump_list2queue, parse_doc

class ThreadFlag:
    def __init__(self, threads, jid, db):
        self.threads = threads
        self.job_id = jid
        self.db_direct = db
        self.xml_parser = HTMLParser()

class WebView(QWebView):
    def __init__(self, thread_flag, id_no):
        super(QWebView, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)
        self.settings().globalSettings().setAttribute(QWebSettings.AutoLoadImages, False)
        # This is actually a dict with a few additional details about the url we want to pull
        self.url = None
        # doing one instance of this to avoid memory leaks
        self.qurl = QUrl()
        # id of the webview instance
        self.id = id_no
        # Status webview instance, green mean it isn't working and yellow means it is.
        self.status = 'GREEN'
        # Reference to a single universal object all the webview instances can see.
        self.thread_flag = thread_flag

    def handleLoadFinished(self):
        try:
            self.processCurrentPage()
        except Exception as e:
            print e

        self.status = 'GREEN'

        if not self.fetchNext():
            # We're finished!
            self.loadFinished.disconnect()
            self.stop()
        else:
            # We're not finished! Do next url.
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

    def processCurrentPage(self):
        self.frame = str(self.page().mainFrame().toHtml().toUtf8())

        # This is the case for the initial web pages I want to gather links from.
        if 'name' in self.url:
            # Parse html string for links I'm looking for.
            new_links = parse_doc(self.thread_flag.xml_parser, self.url, self.frame)
            if len(new_links) == 0: return 0
            fkid = self.url['pkid']
            new_links = map(lambda x: (fkid, x['title'],x['url'], self.thread_flag.job_id), new_links)


            # Post links to database, db de-dupes and then repull ones that made it.
            self.thread_flag.db_direct.post_links(new_links)
            added_links = self.thread_flag.db_direct.get_links(self.thread_flag.job_id,fkid)

            # Add the pulled links to central queue all the qwebviews pull from
            dump_list2queue(added_links, self._urls)
            del added_links
        else:
            # Process one of the links I pulled from the initial set of data that was originally in the queue.
            print "Processing target link!"

    # Get next url from the universal queue!
    def fetchNext(self):
        if self._urls and self._urls.empty():
            self.status = 'GREEN'
            return False
        else:
            self.status = 'YELLOW'
            self.url = self._urls.get()
            return True

    def start(self, urls):
        # This is where the reference to the universal queue gets made.
        self._urls = urls
        if self.fetchNext():
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

# uq = central url queue shared between webview instances
# ta = array of webview objects
# tf - thread flag (basically just a custom universal object that all the webviews can access).

# This main "program" is started by another script elsewhere.
def main_program(uq, ta, tf):

    app = QApplication([])
    webviews = ta
    threadflag = tf

    tf.app = app

    print "Beginning the multiple async web calls..."

    # Create n "threads" (really just webviews) that each will make asynchronous calls.
    for n in range(0,threadflag.threads):
        webviews.append(WebView(threadflag, n+1))
        webviews[n].start(uq)

    app.exec_()

Here's what my memory tools say (they're all about constant through the whole program)

  1. RAM: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024

2491(MB)

  1. Objgraph most common types:

methoddescriptor 9959

function 8342

weakref 6440

tuple 6418

dict 4982

wrapper_descriptor 4380

getset_descriptor 2314

list 1890

method_descriptor 1445

builtin_function_or_method 1298

  1. Heapy:

Partition of a set of 9879 objects. Total size = 1510000 bytes.

Index Count % Size % Cumulative % Kind (class / dict of class)

 0   2646  27   445216  29    445216  29 str

 1    563   6   262088  17    707304  47 dict (no owner)

 2   2267  23   199496  13    906800  60 __builtin__.weakref

 3   2381  24   179128  12   1085928  72 tuple

 4    212   2   107744   7   1193672  79 dict of guppy.etc.Glue.Interface

 5     50   1    52400   3   1246072  83 dict of guppy.etc.Glue.Share

 6    121   1    40200   3   1286272  85 list

 7    116   1    32480   2   1318752  87 dict of guppy.etc.Glue.Owner

 8    240   2    30720   2   1349472  89 types.CodeType

 9     42   0    24816   2   1374288  91 dict of class
Community
  • 1
  • 1
  • Qt4 is obsolete - official support ended over two years ago. There is no chance of problems like this ever being fixed. You need to switch to PyQt5 and use web-engine. – ekhumoro Jul 08 '18 at 09:57
  • I'm worried this is fundamental to Qt.. and Qt5 will have the same problems. But you're right. I'll switch soon, I'm pretty tired of trying to fix this. – crimson_caesar Jul 10 '18 at 06:48
  • Well, you have a test case - all you need to do is port it to pyqt5 and run it. Of course, there *still* might be a problem, but at least there is a much better chance that it will be fixed at some point. You should also note that the next release of pyqt4 - presumably 4.12.2, or perhaps 4.13 - will be the ***last one***. So if you *can* switch to pyqt5, you should do it sooner rather than later, regardless of the issues with web-view. – ekhumoro Jul 10 '18 at 10:40
  • I see that your example seems to have been partly based on [some code that I wrote](https://stackoverflow.com/a/21294180/984421). However, you have made some changes to it that may be the cause of the problem. In particular, you are using the `QWebView` class instead of `QWebPage`, and you are keeping references to multiple instances of it in a list. I assume this is because you want to process the urls in parallel. The obvious question, therefore, is this: if you you use only ***one*** instance to process all the urls, is the memory usage the same? – ekhumoro Jul 10 '18 at 11:05

1 Answers1

0

Your program is indeed experiencing growth due to C++ code, but it is not an actual leak in terms of the creation of objects that are no longer referenced. What is happening, at least in part, is that your QWebView holds a QWebPage which holds a QWebHistory(). Each time you call self.load the history is getting a bit longer.

Note that QWebHistory has a clear() function.

Documentation is available: http://pyqt.sourceforge.net/Docs/PyQt4/qwebview.html#history

Tim Boddy
  • 1,019
  • 7
  • 13
  • Hey I tried adding this to my code. I think it helped a little? (clearing after the page gets processed), however, the RAM keeps building. When I add more urls, it can get as high as 6 gigs. :( I can't help but think there's got to be something else getting built in the background. – crimson_caesar Jul 10 '18 at 06:47
  • Can you show the exact lines that you added? Also, I would encourage you to follow the suggestion from ekhumoro because it would not be good to track down an issue that turns out to be in PyQt4 code that is no longer supported. If you are interested in trying https://github.com/vmware/chap on a live core from your process, I can help you figure out where the 6 gigs come from, but again you are probably better off following ekhumoro's suggestion in the very near future. – Tim Boddy Jul 10 '18 at 09:48