http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70
If I use the following python code to parse the above HTML page, I will get UnicodeDecodeError
.
from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
If I filter the input with iconv -f utf-8 -t utf-8 -c
first, then run the same python code, I still get UnicodeDecodeError
. What is a robust filter (without knowing the encoding of the input HTML) so that the filtered result always work with the python code? Thanks.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
EDIT: Here are the commands used.
$ wget 'http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70'
$ ./main.py < 'view.html?doi=10.15430%2FJCP.2018.23.2.70'
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
$ iconv -f utf-8 -t utf-8 -c < 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | ./main.py
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte