from urllib.request import urlopen
html = urlopen("http://www.google.com/").read().decode('utf-8').replace("preview","")
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if any(c.isalpha() for c in data):
print(data)
MyHTMLParser().feed(html)
input()
So I am trying to make a program that looks at a website and saves the data, then displays the main data of the HTML. This will work with google perfectly, and also perfectly in the IDLE, but any other site with unicode characters like \u2605
(black star) or \u00A9
(copyright) in cmd will crate an error. This error immediately closes the cmd window. The traceback is:
"UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 8: character maps to (undefined)"
I could have a lot of .(replace) for most of them on the website, but i'm sure there is a simple way of just converting it so it can read it, or just replacing them with "".