Unicode parsing error

Question

from urllib.request import urlopen
html = urlopen("http://www.google.com/").read().decode('utf-8').replace("preview","")
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        if any(c.isalpha() for c in data):
            print(data)
MyHTMLParser().feed(html)
input()

So I am trying to make a program that looks at a website and saves the data, then displays the main data of the HTML. This will work with google perfectly, and also perfectly in the IDLE, but any other site with unicode characters like \u2605 (black star) or \u00A9 (copyright) in cmd will crate an error. This error immediately closes the cmd window. The traceback is:

"UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 8: character maps to (undefined)"

I could have a lot of .(replace) for most of them on the website, but i'm sure there is a simple way of just converting it so it can read it, or just replacing them with "".

score 0 · Answer 1 · edited May 23 '17 at 12:30

0

After looking at: UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

Following advice #2, the solution seems to involve importing sys and encoding your string with sys.stdout.encoding and errors='ignore'

html = urlopen("http://www.google.com/").read().encode(sys.stdout.encoding, errors='replace').decode('utf-8')`

You might have to decode that once more...I'm not super sure since I haven't set this problem up on my machine

edited May 23 '17 at 12:30

Community

1
1

answered Jun 14 '14 at 07:38

ForgetfulFellow

2,477
2
22
33

I changed it to html = urlopen("http://www.google.com/").read().decode('utf-8').encode(sys.stdout.encoding, errors='replace').decode('utf-8') new error unicode decode error 'utf-8' codec con't decode byte 0xff in position 5311: invalid start byte – user3739743 Jun 14 '14 at 08:37
I don't think you can add 'errors' to the URLopen method – ForgetfulFellow Jun 16 '14 at 23:07

Unicode parsing error

1 Answers1