12

I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb.

i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem.

The problem in this line of code:

HTML = lxml.html.fromstring(htmltext)

Maybe someone know what it can be, or hoe to fix this?

Thanks for help.

P.S.

Linux Debian-50-lenny-64-LAMP 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64    GNU/Linux
Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

UP:

i set ulimit -Sv 500000 and uliit -Sm 615000 for processes that use lxml parser.

And now in with some time they start writing in error log:

"Exception MemoryError: MemoryError() in 'lxml.etree._BaseErrorLog._receive' ignored".

And i can't catch this exception so it writes recursively in log this message untile there is free space on disk.

How can i catch this exception to kill process so daemon can create new one??

Andrey Nikishaev
  • 3,759
  • 5
  • 40
  • 55
  • Can't you just run it in a virtual machine with 4 GB memory? –  Mar 10 '11 at 14:01
  • i can create monitor process that will restart main process if it starts eats memory. But this is not an answer,because bug remains. – Andrey Nikishaev Mar 12 '11 at 06:27
  • http://mailman-mail5.webfaction.com/listinfo/lxml – John Machin Mar 18 '11 at 19:01
  • 1
    did you ever fix this? im stuck with the same problem - extremely frustrating! perhaps why nobody's picked up the bug is that it runs quite normal for about ten minutes, then BOOM, 100% RAM. – knutole Feb 22 '13 at 23:39
  • I tried to post issue to the lxml guys, but they rejected it. For more details you can look here: https://bugs.launchpad.net/lxml/+bug/728924 – Andrey Nikishaev Mar 05 '13 at 11:14
  • same here. In my case it is ok for few html and then for some of them it jumps for 100MB or something and it never goes down although I reuse parsed variable (in theory it should be garbage colected) – Ivan Longin Feb 22 '16 at 13:56

3 Answers3

7

You might be keeping some references which keep the documents alive. Be careful with string results from xpath evaluation for example: by default they are "smart" strings, which provide access to the containing element, thus keeping the tree in memory if you keep a reference to them. See the docs on xpath return values:

There are certain cases where the smart string behaviour is undesirable. For example, it means that the tree will be kept alive by the string, which may have a considerable memory impact in the case that the string value is the only thing in the tree that is actually of interest. For these cases, you can deactivate the parental relationship using the keyword argument smart_strings.

(I have no idea if this is the problem in your case, but it's a candidate. I've been bitten by this myself once ;-))

Steven
  • 28,002
  • 5
  • 61
  • 51
  • i don't use xpath. My problem in this line of code HTML = fromstring(htmlcode). Because while i tested this bug, i remove anything else from the script. There is no any keeped references also. – Andrey Nikishaev Mar 10 '11 at 14:01
  • Hi steven,I just encounter the lxml memory problem,I use smart_strings=False solve my problem without changing any other codes. But what I'm curious is why python garbage collector didn't collect the memory back to the system? I didn't use the tree anymore. – kuafu Feb 01 '13 at 07:55
  • @young001: you probably still used the strings, and because they were "smart strings", they still used the tree (by keeping a reference to their element...) meaning that as long as you keep a reference to such a "smart string", this will prevent the tree from being garbage collected. – Steven Feb 01 '13 at 09:43
  • I use it in a function,return the strings then store it into db,so what's situation will be like "still used the strings". the string should be collected when the function ends. No global varible in the function. – kuafu Feb 01 '13 at 10:54
  • @young001: if you return such a "smart string" from a function, the return value will still be a "smart string" and will still have references to the tree, so as long as you keep a reference to that return value... Just returning it doesn't convert it to a normal string (you could do `return str(x)` of course to force that). – Steven Feb 01 '13 at 14:53
  • what about just `del`eting the object when you're done with it? – knutole Feb 12 '13 at 19:58
  • 1
    although it's weird - because the memory use doesn't grow, it just explodes at some point. – knutole Feb 12 '13 at 20:01
1

There is an excellent article at http://www.lshift.net/blog/2008/11/14/tracing-python-memory-leaks which demonstrates graphical debugging of memory structures; this might help you figure out what's not being released and why.

Edit: I found the article from which I got that link - Python memory leaks

Community
  • 1
  • 1
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
0

It seems the issue stems from the library lxml relies on: libxml2 which is written in C language. Here is the first report: http://codespeak.net/pipermail/lxml-dev/2010-December/005784.html This bug hasn't been mentioned either in lxml v2.3 bug fix logs or in libxml2 change logs.

Oh, there is followup mails here: https://bugs.launchpad.net/lxml/+bug/728924

Well, I tried to reproduce the issue, but get nothing abnormal. Guys who can reproduce it may help to clarify the problem.

Walden Lake
  • 119
  • 1
  • 4
  • I tried to reproduce it about two weeks. But got nothing. It seems to be the schroedinbug. – Andrey Nikishaev Nov 25 '11 at 16:59
  • 1
    Just to clarify, this bug - at least for me - is not an increasing of RAM use (like normal mem leaks). It's running fine, completely normal, but then suddenly, after ten minutes say (which is perhaps 15-2000 objects), my RAM is completely flooded and will stay like that until process is killed. So it's not a "normal" (or at least transparent) mem leak. FYI. – knutole Feb 23 '13 at 00:06