BeautifulSoup error 'charmap' codec can't encode character

Question

This is the code I currently have

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36'}
r = requests.get("http://www.google.com", headers=headers)
page_text = r.text
soup = BeautifulSoup(page_text, 'html.parser')
print(soup.prettify())

In theory it should send a request to google, get the text back and use beautifulsoup's method of prettify()

Here's their example code (from http://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

Everytime I run this code I get the codec error. Here's a screenshot of the exact error

FOUND A SOLUTION

The solution is instead of using print() to use this print method from a stack exchange member.

def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
    enc = file.encoding
    if enc == 'UTF-8':
        print(*objects, sep=sep, end=end, file=file)
    else:
        f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
        print(*map(f, objects), sep=sep, end=end, file=file)

The problem is your shell encoding, cmd is basically crap. If I were you I would save yourself a lot of headaches and install cygwin https://www.cygwin.com/ or use a decent ide — Padraic Cunningham, Mar 26 '16 at 01:04

n1c9 · Answer 1 · 2016-03-26T00:48:52.743

0

this happens when your terminal/powershell can't print out whatever character it's receiving from BeautifulSoup. two ways to fix it, first better than the second:

as referenced in PEP 0263, you can declare what encoding python should use by typing # coding=<encoding name> or # -*- coding: <encoding name> -*- where you would put the shebang line.
not the recommended method - at the beginning of your python script,
```
import sys
reload(sys)
sys.setdefaultencoding('utf8') # or whichever one you want to use.
```
this is the not recommended method because it's really kind of a misuse of the sys module, but works in a pinch if you are writing a program that isn't terribly complex.

edited Mar 26 '16 at 00:48

answered Mar 26 '16 at 00:31

n1c9

2,662
3
32
52

well, what error is it giving you when you use either method? – n1c9 Mar 26 '16 at 00:35
First I added #coding=utf8 to the top of my program, same error. http://i.imgur.com/vppHB5A.png – Keatinge Mar 26 '16 at 00:37
try the second method – n1c9 Mar 26 '16 at 00:38
1

Then I tried the sys part. I get this problem "NameError: name 'reload' is not defined. http://i.imgur.com/svNj5lc.png – Keatinge Mar 26 '16 at 00:39
after a quick check that error comes up in python3. try `from imp import reload` – n1c9 Mar 26 '16 at 00:45
Okay so now I'm getting a new different error. 'module sys has no attribute 'setdefaultencoding' http://i.imgur.com/iDuuqr7.png – Keatinge Mar 26 '16 at 00:49
1

Maybe i'm doing this in an overly confusing way. All I want to do is download a website and parse it with beautifulsoup, clearly I must be doing something majorly wrong because I'm sure this is very common. – Keatinge Mar 26 '16 at 00:50
I'd love for you to tell me how PEP 0263 is "completely incorrect.", @PadraicCunningham – n1c9 Mar 26 '16 at 02:04

BeautifulSoup error 'charmap' codec can't encode character

1 Answers1