I need to work with a page, which has an unfortunate mix of correct and incorrect HTML entities; for instance:
<i>Kristján Víctor</i>
This, in Firefox 67, does get interpreted correctly, eventually:
... however, if we do "View Source", Firefox indicates via syntax color that something is wrong with the first HTML entity:
... and indeed there is, a semicolon at the end of the HTML entity is missing - however, somehow Firefox figures it out, and renders the right character.
Now, if I try to work with that in lxml:
#!/usr/bin/env python3
import lxml.html as LH
import lxml.html.clean as LHclean
testhtmlstring = "<i>Kristján Víctor</i>"
myhtml = LH.fromstring( testhtmlstring )
myhtml = LHclean.clean_html( myhtml )
myitem = myhtml.xpath("//i")[0]
myitemstr = myitem.text_content()
print(myitemstr)
... the code prints out this in terminal (Ubuntu 18.04):
Kristján Víctor
... so, obviously, the broken htmlentity did not get converted to the right character.
Is there something I can use, so I get the right character in my output string from lxml, even in case of a broken htmlentity (as Firefox does)?