1

I have following code:

from lxml import etree
from io import StringIO

html = """"Hello, world!"<span class="black">
<div class="c1">division
    <p>"Hello - this is me.
    (c) passage in division"
    <b>"bold in passage "</b>
    </p>
        My phone:
    (+7) 999-999-99-99
</div>
<!-- Comment -->
<pre>It's a pre.</pre>
"""

def parse_HTML(html):
    parser = etree.HTMLParser()
    root = etree.parse(StringIO(html), parser)

    for elem in root.getiterator():
        # skip comments, their type == class 'cython_function_or_method'
        if type(elem.tag) is not str:
            continue
        if elem.text is None:
            text = ''
        else:
            text = elem.text

        print(str(elem.tag) + " => " + text)

if __name__ == "__main__":
    parse_HTML(html)

Output:

html =>
body =>
p => "Hello, world!"
span =>

div => division

p => "Hello - this is me.
    (c) passage in division"

b => "bold in passage "
<class 'cython_function_or_method'>
pre => It's a pre.

Question: Why string " My phone: (+7) 999-999-99-99" not exist in the output?

l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • It doesn't display it because it's not enclosed in tags (eg. `

    ...

    `)
    – l'L'l May 02 '18 at 06:54
  • You have mixed mode XML. The `
    ` has text intermixed with sub elements. `elem.text` only gives the text up to the first sub element. `elem.text + (elem.tail or '')` get the tail, or remaining text. Its explained in [elements contain text](http://lxml.de/tutorial.html#elements-contain-text)
    – tdelaney May 02 '18 at 07:12
  • Thank you very much, tdelaney! I don't believe how fast i got my answer here. – Ramil Yabbarov May 02 '18 at 07:16

0 Answers0