1

I'm coding a wrapper with python 3. Testing it i found a little problem with an html page encoded with utf-8

Traceback (most recent call last):
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 121, 
in <module>
    xpaths_extraction()
  File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 117, in xpaths_extraction
    xpaths = get_xpaths(html_file)
  File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 21, in get_xpaths
    content = clean.clean_html(htmlFile)
  File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 8, in clean_html
    return clean_parsed_html(parsed_html)
  File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 24, in clean_parsed_html
    refactored_url = cleaner.clean_html(parsed_html)
  File "src/lxml/html/clean.py", line 520, in lxml.html.clean.Cleaner.clean_html
  File "src/lxml/html/clean.py", line 396, in lxml.html.clean.Cleaner.__call__
  File "/home/caiocesare/PycharmProjects/script/venv/lib/python3.6/site-packages/lxml/html/__init__.py", line 364, in drop_tag
    if self.text and isinstance(self.tag, basestring):
  File "src/lxml/etree.pyx", line 1014, in lxml.etree._Element.text.__get__
  File "src/lxml/apihelpers.pxi", line 670, in lxml.etree._collectText
  File "src/lxml/apihelpers.pxi", line 1405, in lxml.etree.funicode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 98: invalid start byte

The funny thing is that the page is with utf-8. All my urls are with utf so for 80000 pages with utf-8 no problem when I start using the 80001 decoder problem. Why do I have this error and there is a way to solve it ?

def clean_html(url):
    parsed_html = lxml.html.parse(url)
    return clean_parsed_html(parsed_html)

def clean_parsed_html(parsed_html):
    if parsed_html.getroot() == None:
        return ""

    cleaner = Cleaner()
    cleaner.javascript = True
    cleaner.style = True
    cleaner.kill_tags = ['head', 'script', 'header', 'href', 'footer', 'a']
    refactored_url = cleaner.clean_html(parsed_html)
    return lxml.html.tostring(refactored_url)
lenz
  • 5,658
  • 5
  • 24
  • 44
  • 1
    We need to see some code if you want help debugging code. You can't go to a vet and expect it to cure your dog without _bringing_ your dog to the vet, can you? – Aran-Fey Apr 23 '18 at 06:57
  • Probably that one page only claims to be encoded with UTF-8, but is actually using a different codec. While this would mean that the *input* is broken, not your code, you're still going to be the one who has to deal with it. So in order to be helped, you need to show your code, as pointed out by Aran-Fey. – lenz Apr 23 '18 at 09:37
  • i posted the code sorry for the inconvinient, but I was sure I posted it with the error – claudio gugliotta Apr 23 '18 at 15:22
  • You should ignore encoding errors: do no trust webpage encoding. Sometime careless programmers mix encodings (this is true especially on dynamic pages, where different part are created by different tools/databases). Or just copy paste (maybe also from user, and no encoding validation on server side). – Giacomo Catenazzi Apr 24 '18 at 07:47
  • @GiacomoCatenazzi Your suggestion isn't really applicable to the described problem, where it's lxml's HTML parser that has problems parsing the document, not a text file decoded by the OP's code. Or can you show how to ignore encoding errors in this context? – lenz Apr 24 '18 at 09:48
  • @lenz: Not really (for such reason I did just a comment not a full answer). BTW question keywords were just python and utf-8, so I do not expect we need to find a solution for lxml (in fact the parser is used just for a little task). – Giacomo Catenazzi Apr 24 '18 at 09:58
  • the problem is that i pass an file .html, downloaded with an other program, in this task i have to clean one html at the time after return it. So @GiacomoCatenazzi the best way is to ignore them ? with a try o somethink similar? – claudio gugliotta Apr 24 '18 at 10:18
  • Often python decoders have a `errors='ignore'` flags. Here it say that you can do with lxml: https://stackoverflow.com/questions/44352989/python-lxml-ignore-xml-declaration-errors `recover=True` check for your case – Giacomo Catenazzi Apr 24 '18 at 12:31
  • @GiacomoCatenazzi The `recover` flag is `True` by default for HTML. I think it has nothing to do with encoding; it's about broken HTML syntax-wise. – lenz Apr 24 '18 at 19:37
  • @lenz: but the error code tell us that file found `0xb1` byte (in an invalid position for utf-8). Possibly that char is in an incorrect position (e.g. in tag name). And recover was not about the lazy html syntax (POV xml). OTOH 1 in 8000 is probably an exception (so I think a subtle bug). @claudiogugliotta: could you retrieve the page with problem (and maybe reduce it). I think it is worth to investigate further (and probably for a bug report) – Giacomo Catenazzi Apr 25 '18 at 06:19
  • @GiacomoCatenazzi Sorry for being unclear. The OP's problem is indeed related to encoding. But the `recover` option apparently doesn't help here. The docs say: *"recover - try hard to parse through broken HTML (default: True)"* (cf. `help(lxml.etree.HTMLParser)`). – lenz Apr 25 '18 at 08:27

1 Answers1

1

You can try the following to override the encoding of a document you are parsing:

parsed_html = lxml.html.parse(url, parser=lxml.html.HTMLParser(encoding=CODEC))

For the placeholder CODEC, you specify the actual encoding of the document, eg. "Latin-1".

This solution has two obvious drawbacks:

  • You need to find out what the document actually is. This means some trial and error. For example, open the document in your browser (preferrably in source-code view) and use the browser's menu to change the encoding (in Firefox, this is under view). If there are more documents like that, you'll have to repeat that process.
  • You need to treat the problematic document(s) specially, as some kind of a fallback. It's probably best to use a try/except construct, where you first try to parse them according to the encoding declaration.

So this is not a fully automatic solution. Since the input is broken, there's not much you can do – there are no magic tools that can recover any mistake by other software, at best there is a decent heuristic. If you have a lot of broken input and you don't particularly care about them, just skip them, so you can process the sane portion.

lenz
  • 5,658
  • 5
  • 24
  • 44