I'm parsing Real-World HTML files with lxml. This means, I want to extract information from tags and I don't have the control of the style. The problem I'm having lies within the data.
<fieldset>
<legend>
<strong>Notes</strong>
</legend>
Slav *kǫda 'thither', kǫdě 'where, whither' < IE *k(w)om-d(h)
</fieldset>
The problem is due to the sign < within the data, the HTML parser of lxml will skip the text and the endtag, but this is exactly the text I want to extract. Is there any solution I can apply to get the text out of this tag?