0

I want to input html document into my python.

I get this error:

UnicodeDecodeError: 'cp950' codec can't decode byte 0xbb in position
362: illegal multibyte sequence

when using this code:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open(xxx.html))  
print(soup)

What am I doing wrong?

croxy
  • 4,082
  • 9
  • 28
  • 46
Revol
  • 1
  • Possible duplicate of [UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c](https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c) – Max Sep 23 '17 at 05:26

1 Answers1

0

you are facing a encode/decode problem.
try this:

soup = BeautifulSoup(open('xxx.html', encoding='your xxx.html file encoding'))

you can find 'your xxx.html encoding' by searching 'charset' in the file.
then, you will get something like charset=utf-8 or other charset=xxx
behind '=', 'utf-8' or 'xxx', is your xxx.html encoding

Hohenheim
  • 535
  • 6
  • 7