UnicodeEncodeError: "ascii" can't encode character '\xe0' while parsing HTML (Python)

Question

I'm parsing HTML by inheriting HTMLParser, which is a class coming from the library html.parser. I'm making a web scraper. I have set "convert_charrefs" to true. The program downloads a page by doing "downloadPage(url)" and passes it to myParser (I think It will be better for you if I don't paste here all my code). When the parser finds the link I'm interested to (e.g Attività e procedimenti) from a web site, the program get the value of the attribute "href" and tries to download the page linked by href, by doing "downloadPage(href)", passes it to myParser and so on... The code for downloadPage(href) is the following:

def getCharset(response):
    str = response.info()["Content-type"]
    if str:
        end = re.search("charset=", str).span()[1]
        if end:
            return str[end:]
        else:
            return "ascii"
    else:
        return "ascii"

def downloadPage(url):
    response = urllib.request.urlopen(url)
    charset = getCharset(response)
    return response.read().decode(charset)

Now, the problem is that certain link has some vowel stressed, such as "http://città.it/" (last url is faked). Not all links found in a web page are made of Unicode characters. So the following code sometimes raises UnicodeEncodeError:

urllib.request.urlopen(url)

I specify that I can't know at first glance how each link is composed

possible duplicate of [How to fetch a non-ascii url with Python urlopen?](http://stackoverflow.com/questions/4389572/how-to-fetch-a-non-ascii-url-with-python-urlopen) — Samba, Jun 24 '14 at 14:43
Unfortunately, I have found the same thread you're talking about much later — StackUser, Jun 25 '14 at 09:25

score 1 · Accepted Answer · answered Jun 25 '14 at 09:23

I have solved this problem in this way:

def fromIriToUri(iri):
myUri = []
iri = urlsplit(iri)
iri = list(iri)
for i in iri:
    try:
        i.encode("ascii")
        myUri.append(i)
    except UnicodeEncodeError:
        myUri.append(urllib.parse.quote(i))
uri = urllib.parse.urlunsplit(myUri)
return uri

UnicodeEncodeError: "ascii" can't encode character '\xe0' while parsing HTML (Python)

1 Answers1