How to filter none utf-8 HTML to get an utf-8 HTML?

Question

http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70

If I use the following python code to parse the above HTML page, I will get UnicodeDecodeError.

from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte

If I filter the input with iconv -f utf-8 -t utf-8 -c first, then run the same python code, I still get UnicodeDecodeError. What is a robust filter (without knowing the encoding of the input HTML) so that the filtered result always work with the python code? Thanks.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte

EDIT: Here are the commands used.

$ wget 'http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70'
$ ./main.py < 'view.html?doi=10.15430%2FJCP.2018.23.2.70'
Traceback (most recent call last):
  File "./main.py", line 6, in <module>
    doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
  File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
$ iconv -f utf-8 -t utf-8 -c < 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | ./main.py
Traceback (most recent call last):
  File "./main.py", line 6, in <module>
    doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
  File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte

I have no problem read it directly from web page `r = requests.get(url)` and `doc = html.fromstring(r.content)`. What do you have in `sys.stdin` ? `parse()` needs `filename` or `url` - if you use filename or url then problem can be in this string, not in HTML from server. — furas, May 27 '20 at 23:20
I downloaded the file by wget. Then, read the downloaded file from stdin in python using the 2 lines of the python code shown. — user1424739, May 27 '20 at 23:22
it seems file is in `latin1`, not `utf8`. Use `cat filename.html | iconv -f latin1 -t utf-8 | myscript`. See my answer. — furas, May 28 '20 at 01:07
I can not afford to know each input file's encoding. I need a solution that I don't need to know the specific encoding. — user1424739, May 28 '20 at 01:12
then run it with different encoding and catch errors - when one encoding generate error then parse it again with different encoding. Usually pages use three encoding `utf-8`, `latin1` or `cp1250` (eventually some modifications for native chars - like `latin2` for Central Europe). There is also python module `chardet` (char detections) which is used by `requests` to recognize encoding. So you can read file using `open()`, `read()`, in bytes mode, detect encoding and convert to bytes to string. — furas, May 28 '20 at 01:19
see in my answer example which use `try/except` to test it with different encodings. — furas, May 28 '20 at 01:40

furas · Answer 1 · 2020-05-28T13:49:29.117

After digging I found that this file is not in utf-8 but in latin1 and problem has sys.stdin which uses utf-8. But you can't change encoding directly in sys.stdin. You have to use sys.stdin to create new stream with new encoding.

main-latin1.py

import sys
import io
from lxml import html

#input_stream = sys.stdin # gives error 
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')
doc = html.parse(input_stream)

print(html.tostring(doc))

And now you can run

cat  'view.html?doi=10.15430%2FJCP.2018.23.2.70' | python main-latin1.py

EDIT: You can also convert it in console with iconv -f latin1 -t utf-8

cat  'view.html?doi=10.15430%2FJCP.2018.23.2.70' | iconv -f latin1 -t utf-8 | python main-utf8.py

main-utf8.py

import sys
from lxml import html

doc = html.parse(sys.stdin)

print(html.tostring(doc))

BTW: It has no problem to read it directly from page using requests

import requests
from lxml import html

r = requests.get('http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70')

doc = html.fromstring(r.text)

print(html.tostring(doc))

EDIT: You can read data as bytes and use for-loop and try/except to decode with different encoding.

You run it without <

myscript filename.html

import sys
from lxml import html

# --- function ---

def decode(data, encoding):
    try:
        return data.decode(encoding)
    except:
        pass

# --- main ---

# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')

if len(sys.argv) == 1:
    print('need file name')
    exit(1)

data = open(sys.argv[1], 'rb').read()

for encoding in ('utf-8', 'latin1', 'cp1250'):
    result = decode(data, encoding)
    if result:
        print('encoding:', encoding)
        doc = html.fromstring(result)
        #print(html.tostring(doc))
        break

EDIT: I tried to use module chardet (char detection) which uses requests but it gives me windows-1252 (cp1252) instead of latin1. But for some reason requests has no problem to get it correctly.

import sys
from lxml import html
import chardet

# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')

if len(sys.argv) == 1:
    print('need file name')
    exit(1)

data = open(sys.argv[1], 'rb').read()

encoding = chardet.detect(data)['encoding']
print('encoding:', encoding)

doc = html.fromstring(data.decode(encoding))

score 0 · Answer 2 · answered May 27 '20 at 22:52

0

You could filter the input using str = unicode(str, errors='ignore') as suggested in UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c. This is not always desirable since unreadable characters will be dropped, but it may be fine for your use case.

Or it seems lxml can use encoding='unicode' in certain cases. Have you tried that?

answered May 27 '20 at 22:52

jdaz

5,964
2
22
34

I read the input from stdin. I don't call unicode directlyl. Where should I call `unicode(str, errors='ignore')`? How to modify the lxml code so that it will be robust with respect to any encoding? BTW, `encoding='unicode'` of lxml does not work (I don't use `tostring`). – user1424739 May 27 '20 at 22:57

How to filter none utf-8 HTML to get an utf-8 HTML?

2 Answers2