I'm parsing HTML by inheriting HTMLParser, which is a class coming from the library html.parser. I'm making a web scraper. I have set "convert_charrefs" to true. The program downloads a page by doing "downloadPage(url)" and passes it to myParser (I think It will be better for you if I don't paste here all my code). When the parser finds the link I'm interested to (e.g Attività e procedimenti) from a web site, the program get the value of the attribute "href" and tries to download the page linked by href, by doing "downloadPage(href)", passes it to myParser and so on... The code for downloadPage(href) is the following:
def getCharset(response):
str = response.info()["Content-type"]
if str:
end = re.search("charset=", str).span()[1]
if end:
return str[end:]
else:
return "ascii"
else:
return "ascii"
def downloadPage(url):
response = urllib.request.urlopen(url)
charset = getCharset(response)
return response.read().decode(charset)
Now, the problem is that certain link has some vowel stressed, such as "http://città.it/" (last url is faked). Not all links found in a web page are made of Unicode characters. So the following code sometimes raises UnicodeEncodeError:
urllib.request.urlopen(url)
I specify that I can't know at first glance how each link is composed