How python lxml iteration handles tag text?

Question

I have following code:

from lxml import etree
from io import StringIO

html = """"Hello, world!"<span class="black">
<div class="c1">division
    <p>"Hello - this is me.
    (c) passage in division"
    <b>"bold in passage "</b>
    </p>
        My phone:
    (+7) 999-999-99-99
</div>
<!-- Comment -->
<pre>It's a pre.</pre>
"""

def parse_HTML(html):
    parser = etree.HTMLParser()
    root = etree.parse(StringIO(html), parser)

    for elem in root.getiterator():
        # skip comments, their type == class 'cython_function_or_method'
        if type(elem.tag) is not str:
            continue
        if elem.text is None:
            text = ''
        else:
            text = elem.text

        print(str(elem.tag) + " => " + text)

if __name__ == "__main__":
    parse_HTML(html)

Output:

html =>
body =>
p => "Hello, world!"
span =>

div => division

p => "Hello - this is me.
    (c) passage in division"

b => "bold in passage "
<class 'cython_function_or_method'>
pre => It's a pre.

Question: Why string " My phone: (+7) 999-999-99-99" not exist in the output?

It doesn't display it because it's not enclosed in tags (eg. `
...
`) — l'L'l, May 02 '18 at 06:54
You have mixed mode XML. The `
` has text intermixed with sub elements. `elem.text` only gives the text up to the first sub element. `elem.text + (elem.tail or '')` get the tail, or remaining text. Its explained in [elements contain text](http://lxml.de/tutorial.html#elements-contain-text) — tdelaney, May 02 '18 at 07:12
Thank you very much, tdelaney! I don't believe how fast i got my answer here. — Ramil Yabbarov, May 02 '18 at 07:16

How python lxml iteration handles tag text?

0 Answers0