'utf-8' codec can't decode byte 0xf6 in position 139604: invalid start byte

Question

I am making a knowledge engineering project.

When I was crawling some scientists personal site, this bug occurred.

import html2text
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import urllib


homepage = "http://angom.myweb.cs.uwindsor.ca"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=homepage, headers=headers)
print(req)
c = urlopen(req).read()
print(type(c))

content = urlopen(req).read().decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 139604: invalid start byte

`print(c[139600:139610])` would give a hint maybe? – Grimmy Jul 18 '17 at 04:15 — Grimmy, Jul 18 '17 at 04:15

Grimmy · Answer 1 · 2017-07-18T04:28:39.047

0

The encoding in the page header states:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

.. so use that when decoding the string.

content = urlopen(req).read().decode("windows-1252")

will work in this instance.

If you are planning to use BeautifulSoup, it already does a really good job figuring out the encoding.

edited Jul 18 '17 at 04:28

answered Jul 18 '17 at 04:20

Grimmy

3,992
22
25

'utf-8' codec can't decode byte 0xf6 in position 139604: invalid start byte

1 Answers1