0

I am trying to get the original url from requests. Here is what I have so far:

res = requests.get(...)
url = urllib.unquote(res.url).decode('utf8') 

I then get an error that says:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)

The original url I requested is:

https://www.microsoft.com/de-at/store/movies/american-pie-pr\xc3\xa4sentiert-nackte-tatsachen/8d6kgwzl63ql

And here is what happens when I try printing:

>>> print '111', res.url
111 https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '222', urllib.unquote( res.url )
222 https://www.microsoft.com/de-at/store/movies/american-pie-präsentiert-nackte-tatsachen/8d6kgwzl63ql
>>> print '333', urllib.unquote(res.url).decode('utf8') 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-61: ordinal not in range(128)

Why is this occurring, and how would I fix this?

David542
  • 104,438
  • 178
  • 489
  • 842

1 Answers1

4
UnicodeEncodeError: 'ascii' codec can't encode characters

You are trying to decode a string that is Unicode already. It raises AttributeError on Python 3 (unicode string has no .decode() method there). Python 2 tries to encode the string into bytes first using sys.getdefaultencoding() ('ascii') before passing it to .decode('utf8') which leads to UnicodeEncodeError.

In short, do not call .decode() on Unicode strings, use this instead:

print urllib.unquote(res.url.encode('ascii')).decode('utf-8')

Without .decode() call, the code prints bytes (assuming a bytestring is passed to unquote()) that may lead to mojibake if the character encoding used by your environment is not utf-8. To avoid mojibake, always print Unicode (don't print text as bytes), do not hardcode the character encoding of your environment inside your script i.e., .decode() is necessary here.


There is a bug in urllib.unquote() if you pass it a Unicode string:

>>> print urllib.unquote(u'​%C3%A4')
ä
>>> print urllib.unquote('​%C3%A4') # utf-8 output
ä

Pass bytestrings to unquote() on Python 2.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • `type(urllib.unquote("https://www.microsoft.com/de-at/store/movies/american-pie-pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql"))` == ``. I think the problem is with his locale – Alastair McCormack Dec 27 '15 at 09:54
  • Doesn't matter surely. The OP doesn't get an exception from `urllib.unquote( res.url )` and I don't get an exception if I use a Unicode: `urllib.unquote(u"https://www.microsoft.com/de-at/store/movies/american-pie-‌​pr%C3%A4sentiert-nackte-tatsachen/8d6kgwzl63ql")` – Alastair McCormack Dec 27 '15 at 09:57
  • 1
    @AlastairMcCormack: there are 3 separate issues here. And the solution that fixes all 3 issues is to `.encode()` and then `.decode()` as shown in the answer. `type(res.url)` is unicode in the question otherwise we won't see `UnicodeEncodeError` and [`urllib.unquote()` is broken for unicode urls](http://bugs.python.org/issue8136). – jfs Dec 27 '15 at 10:05
  • Sorry, you're right. I got the wrong end of the stick :) – Alastair McCormack Dec 27 '15 at 10:23
  • @AlastairMcCormack: don't apologize for this, I'm grateful for the feedback. Everybody make mistakes, [here's my recent brain failure.](http://stackoverflow.com/a/33641463/4279) – jfs Dec 27 '15 at 10:35