1

I'm using Requests and BeautifulSoup with Python 3.4 to scrape information off a website that may or may not contain Japanese or other special characters.

def startThisPage(url):
    r = requests.get(str(url))
    r.encoding="utf8"
    print(r.content.decode('utf8'))
    soup = BeautifulSoup(r.content,'html.parser')
    print(soup.h2.string)

The h2 contains this: "Fate/kaleid liner Prisma ☆ Ilya Zwei!" and I'm pretty sure the star is what is giving me troubles right now.

The error code that is being thrown at me:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 25: character maps to <undefined>

The page is encoded with utf8 and hence I tried to encode and decode with utf8 the byte string I'm receiving with r.content. I've also tried to decode first with unicode_escape thinking it was because of double \ but that wasn't the case. Any ideas?

MooingRawr
  • 4,901
  • 3
  • 24
  • 31
  • Are you on Windows? Printing UTF-8 to windows consoles is notoriously not going to work. – OdraEncoded Aug 24 '15 at 22:28
  • I am running window 7 64bit. And how would I get around it since I don't have Ubuntu installed. @OdraEncoded – MooingRawr Aug 24 '15 at 23:22
  • You could write it to a file instead of printing or remove non-ASCII characters. You could also make a GUI for showing the output if you need it real time. Honestly I wouldn't bother trying to get the windows console to display characters right, maybe PowerShell (the new C#-based command prompt) can print them. – OdraEncoded Aug 24 '15 at 23:43
  • I still get the same error trying to write to a file... – MooingRawr Aug 25 '15 at 00:07
  • unrelated: you could use `BeautifulSoup(requests.get(url).text)` pass Unicode or even `BeautifulSoup(urllib.request.urlopen(url))` to pass bytes as is (assuming `urlopen()` works for the url). – jfs Aug 25 '15 at 07:42

1 Answers1

2

soup.h2.string is a Unicode string. The console character encoding such as cp437 can't represent some of the Unicode characters (☆ -- U+2606 WHITE STAR) that leads to the error. To workaround it, see my answer to "Python, Unicode, and the Windows console" question.

I still get the same error trying to write to a file..

Files (created using open()) use locale.getpreferredencoding(False) such as cp1252 by default. Use the explicit character encoding that supports the full Unicode range instead:

import io

with io.open('title.txt', 'w', encoding='utf-8') as file:
    file.write(soup.h2.string)
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670