Python website scraper UnicodeEncodeError

Question

I'm using Requests and BeautifulSoup with Python 3.4 to scrape information off a website that may or may not contain Japanese or other special characters.

def startThisPage(url):
    r = requests.get(str(url))
    r.encoding="utf8"
    print(r.content.decode('utf8'))
    soup = BeautifulSoup(r.content,'html.parser')
    print(soup.h2.string)

The h2 contains this: "Fate/kaleid liner Prisma ☆ Ilya Zwei!" and I'm pretty sure the star is what is giving me troubles right now.

The error code that is being thrown at me:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 25: character maps to <undefined>

The page is encoded with utf8 and hence I tried to encode and decode with utf8 the byte string I'm receiving with r.content. I've also tried to decode first with unicode_escape thinking it was because of double \ but that wasn't the case. Any ideas?

Are you on Windows? Printing UTF-8 to windows consoles is notoriously not going to work. — OdraEncoded, Aug 24 '15 at 22:28
I am running window 7 64bit. And how would I get around it since I don't have Ubuntu installed. @OdraEncoded — MooingRawr, Aug 24 '15 at 23:22
You could write it to a file instead of printing or remove non-ASCII characters. You could also make a GUI for showing the output if you need it real time. Honestly I wouldn't bother trying to get the windows console to display characters right, maybe PowerShell (the new C#-based command prompt) can print them. — OdraEncoded, Aug 24 '15 at 23:43
unrelated: you could use `BeautifulSoup(requests.get(url).text)` pass Unicode or even `BeautifulSoup(urllib.request.urlopen(url))` to pass bytes as is (assuming `urlopen()` works for the url). — jfs, Aug 25 '15 at 07:42

score 2 · Accepted Answer · edited May 23 '17 at 12:09

soup.h2.string is a Unicode string. The console character encoding such as cp437 can't represent some of the Unicode characters (☆ -- U+2606 WHITE STAR) that leads to the error. To workaround it, see my answer to "Python, Unicode, and the Windows console" question.

I still get the same error trying to write to a file..

Files (created using open()) use locale.getpreferredencoding(False) such as cp1252 by default. Use the explicit character encoding that supports the full Unicode range instead:

import io

with io.open('title.txt', 'w', encoding='utf-8') as file:
    file.write(soup.h2.string)

Python website scraper UnicodeEncodeError

1 Answers1