Python - cannot decode html (urllib)

Question

I'm trying to write html from webpage to file, but I have problem with decode characters:

import urllib.request

response = urllib.request.urlopen("https://www.google.com")

charset = response.info().get_content_charset()
print(response.read().decode(charset))

Last line causes error:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in 
position 6079: ordinal not in range(128)

response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 
6111: invalid start byte

What's going on?

@theausome Not if i use file.write() function which expect string. — Robin71, Jan 29 '18 at 17:19

Joe · Answer 1 · 2018-01-29T17:52:03.037

3

You can ignore invalid characters using

response.read().decode("utf-8", 'ignore')

Instead of ignore there are other options, e.g. replace

https://www.tutorialspoint.com/python/string_encode.htm

https://docs.python.org/3/howto/unicode.html#the-string-type

(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

edited Jan 29 '18 at 17:52

answered Jan 29 '18 at 17:44

Joe

6,758
2
26
47

Is this fine to do when `print(resp.info().get_content_charset())` returns `None`? Wasn't sure if this was what OP was also seeing as they stored it in a variable. – Maxim Feb 03 '21 at 01:58
I admit it's not totally clean. This means that the system was not able to detect the encodig, probably because it was not expicitly stated in the headers. See https://stackoverflow.com/a/24372670/7919597 and https://stackoverflow.com/questions/4981977/how-to-handle-response-encoding-from-urllib-request-urlopen-to-avoid-typeerr and https://stackoverflow.com/questions/14592762/a-good-way-to-get-the-charset-encoding-of-an-http-response-in-python/14592894 for other approaches to get the charset. "In general the server may lie about the encoding or do not report it at all". – Joe Feb 03 '21 at 08:02

Python - cannot decode html (urllib)

1 Answers1