2

The program is quite simple, recursively descend to directories and extract an element. The directories are 1k with about 200 files of 0.5m. I see that it consumes about 2.5g of memory after some time, it's completely unacceptable, the script's not alone to eat up everything. I cannot understand why it doesn't release the memory. Explicit del doesn't help. Are there any techniques to consider?


from lxml import etree
import os

res=set()
for root, dirs, files in os.walk(basedir):
    for i in files:
        tree = etree.parse(os.path.join(root,i), parser)
        for i in tree.xpath("//a[@class='ctitle']/@href"):
            res.add(i)
        del tree
aikipooh
  • 137
  • 1
  • 19
  • 1
    What type is `i`? What are you going to do with `res`? – Peter Wood Sep 21 '15 at 10:09
  • Oh, well, different i here:) No big problem, but not elegant. Outer i is string, inner i is lxml.etree._ElementUnicodeResult which I believe is converted to a string easily. Could the memory be cluttered with those elements, you mean? I don't delete them, yes. I'll check. For now I store the res (many duplicates, so it's about 100 results per directory). – aikipooh Sep 21 '15 at 10:26
  • 1
    How are you measuring memory consumption? – Lukas Graf Sep 21 '15 at 10:27
  • @LukasGraf: With top. This process constantly increases its memory footprint… And yes, organoleptically too, when everything else is being swapped out:) – aikipooh Sep 21 '15 at 10:29
  • So what are you looking at? VSZ (virtual set size) or RSS (resident set size) (resp VIRT/RES in top)? If it's just VSZ that's high, that isn't necessarily an issue. – Lukas Graf Sep 21 '15 at 10:30
  • @LukasGraf: After 10 directories: KiB Mem: 4049356 total, 4018984 used, 30372 free, 2488 buffers KiB Swap: 12582904 total, 1570596 used, 11012308 free. 52256 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14211 pooh 20 0 1220464 1.083g 3104 R 90.8 28.1 1:22.79 python3.4 – aikipooh Sep 21 '15 at 10:36
  • 3
    The `lxml.etree._ElementUnicodeResult` objects themselves are probably not using that much memory, but since you can do `.getparent()` on them, they keep a reference to the tree, which means the tree can't be garbage collected by Python. So from what I see, turning them into strings before adding them to your set *should* help the garbage collector do its job. – Lukas Graf Sep 21 '15 at 10:38
  • 1
    @Pooh: this is a known and documented behaviour cf http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm and the linked post. – bruno desthuilliers Sep 21 '15 at 10:39
  • Err actually I failed to spot the problem mentionned by Lukas Graf - reopening the question since it might well have something to do with this issue too. – bruno desthuilliers Sep 21 '15 at 10:40
  • @brunodesthuilliers still a good point, I was thinking about that behavior too, but I'm not sure it necessarily applies in this case - OP shouldn't have huge amounts of values of primitive types in memory *at the same time*. Hard to say for sure though. – Lukas Graf Sep 21 '15 at 10:43
  • @LukasGraf: Thank you, it solved the issue. Stable at 0.7%. So well, if you need upvote, you can answer the question:) – aikipooh Sep 21 '15 at 10:45
  • 1
    Great. Memory optimization is rarely that easy, glad it worked ;-) – Lukas Graf Sep 21 '15 at 10:46
  • @LukasGraf: I'm very new to python, so there's still much to learn in regard of differences between pythonic way and my inbred C approach:) – aikipooh Sep 21 '15 at 10:49
  • @LukasGraf: Argh! I was reloading the page but for some reason Peter's answer was not displayed:( Now there was, and was yours, I felt as I ought to give precedence to him as he was the first to react and bring the issue of storing elements in a list into consideration (which you have elaborated). And now I see that you have removed the answer. Again I'm disenchanted by SO I spent once too much time on. – aikipooh Sep 21 '15 at 10:53
  • @Pooh he deleted it intermittently, undeleted and edited it. It's fine, his post answers the issue just as well, so I removed mine. – Lukas Graf Sep 21 '15 at 10:57
  • @LukasGraf I didn't mean to deceive, that's just how it came about (c: I probably should have started a new answer and didn't think of the consequences. Pooh, please don't be disenchanted. This is a great site, with some really helpful, knowledgeable, and genuinely nice people. I wouldn't want my answering your question to make you disenchanted. What would make you happier? I answer to be helpful and to make people happy, not for upvotes (although they make me feel appreciated, that's not what drives me). – Peter Wood Sep 21 '15 at 11:58
  • 1
    @PeterWood: Oh, thank you for replying:) No, SO won't make me any happy, people are good, but having no control over the question (for example I asked something on SO and it was transferred to unix) is what made me to stop using it. So now I use it only when I'm totally stuck. Now I was disenchanted by the fact that I was reloading the page and your answer didn't show up. Now I see it wasn't a bug of SO, and it's good:) – aikipooh Sep 21 '15 at 13:04
  • 1
    @PeterWood no worries, I figured as much. It's all good ;-) – Lukas Graf Sep 21 '15 at 18:04
  • @LukasGraf thank you thank you thank you :-) saved my day – Jabb Jun 20 '18 at 20:49

1 Answers1

3

You're keeping references to an element from the tree, an _ElementUnicodeResult. The element keeps references to its parent. This prevents the whole tree from being garbage collected.

Try converting the element to a string and store that:

from lxml import etree
import os

titles = set()
for root, dirs, files in os.walk(basedir):
    for filename in files:
        tree = etree.parse(os.path.join(root, filename), parser)
        for title in tree.xpath("//a[@class='ctitle']/@href"):
            titles.add(str(title))
Peter Wood
  • 23,859
  • 5
  • 60
  • 99