Python parsing html with lxml: get text of tag while specific sign causes problems

Question

I'm parsing Real-World HTML files with lxml. This means, I want to extract information from tags and I don't have the control of the style. The problem I'm having lies within the data.

<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h) 
</fieldset>

The problem is due to the sign < within the data, the HTML parser of lxml will skip the text and the endtag, but this is exactly the text I want to extract. Is there any solution I can apply to get the text out of this tag?

score 1 · Answer 1 · edited May 23 '17 at 10:27

1

The HTML is actually a broken one.

You can though parse it as is with BeautifulSoup and a lenient html5lib parser:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup


data = u"""
<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
</fieldset>
"""

soup = BeautifulSoup(data, "html5lib")
print(soup.fieldset.legend.next_sibling.strip())

Prints:

Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)

edited May 23 '17 at 10:27

Community

1
1

answered Nov 18 '15 at 17:56

alecxe

462,703
120
1,088
1,195

Thanks for pointing that out. I thought about using beautiful soup too. Is it really a broken html or is it just a letter which the html parser cannot parse? – IssnKissn Nov 19 '15 at 09:46

Python parsing html with lxml: get text of tag while specific sign causes problems

1 Answers1