1

I'm parsing Real-World HTML files with lxml. This means, I want to extract information from tags and I don't have the control of the style. The problem I'm having lies within the data.

<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h) 
</fieldset>

The problem is due to the sign < within the data, the HTML parser of lxml will skip the text and the endtag, but this is exactly the text I want to extract. Is there any solution I can apply to get the text out of this tag?

mzjn
  • 48,958
  • 13
  • 128
  • 248
IssnKissn
  • 81
  • 1
  • 1
  • 6

1 Answers1

1

The HTML is actually a broken one.

You can though parse it as is with BeautifulSoup and a lenient html5lib parser:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup


data = u"""
<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
</fieldset>
"""

soup = BeautifulSoup(data, "html5lib")
print(soup.fieldset.legend.next_sibling.strip())

Prints:

Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for pointing that out. I thought about using beautiful soup too. Is it really a broken html or is it just a letter which the html parser cannot parse? – IssnKissn Nov 19 '15 at 09:46