0

I have a Unicode string I'm retrieving from a web service in python.

I need to access a URL I've parsed from this string, that includes various diacritics.

However, if I pass the unicode string to urlllib2, it produces a unicode encoding error. The exact same string, as a "raw" string r"some string" works properly.

How can I get the raw binary representation of a unicode string in python, without converting it to the system locale?

I've been through the python docs, and every thing seems to come back to the codecs module. However, the documentation for the codecs module is sparse at best, and the whole thing seems to be extremely file oriented.


I'm on windows, if it's important.

Fake Name
  • 5,556
  • 5
  • 44
  • 66
  • 1
    Pro tip: when encountering a Python error, include the *full* traceback and error message in your question. That reduces the amount of guessing we have to do *dramatically*. – Martijn Pieters Dec 28 '12 at 09:24
  • 1
    I want to know *exactly* what is causing the error. You may know, but I am not so sure. Convince me. – Martijn Pieters Dec 28 '12 at 12:00
  • 1
    No, you misunderstand. Python will, by default, encode to ASCII any unicode value you are trying to treat as raw bytes, by concatenating it with other bytes, for example. The work-around is to *encode* it to bytes first. The proper encoding depends on the context; for the domain name it's IDNA, for path parts it's UTF-8, etc. – Martijn Pieters Dec 28 '12 at 12:11
  • Really, really, really read the articles I linked before you continue. – Martijn Pieters Dec 28 '12 at 12:11
  • I had read the Python Unicode HOWTO before I even posted my question. – Fake Name Dec 28 '12 at 12:13
  • I'm sorry, but you are not showing that you are understanding the differences. Have you read the Joel on Software article? I cannot stress enough that you need to understand what encoding really means, and that you need to understand that you *have* to pick an encoding in order to use files or network services, and that there is no such thing as raw unicode bytes. There is *always* an encoding. – Martijn Pieters Dec 28 '12 at 12:16
  • Oh crap, I'm making a prat of myself. I keep saying unicode, and thinking UTF-8. I'm an idiot. WTF. Sorry – Fake Name Dec 28 '12 at 12:20
  • On the other hand, I think I just solved my problem, at least in part. The whole thing only works because the text file was saved as UTF-8. – Fake Name Dec 28 '12 at 12:20
  • @MartijnPieters - Also, I apologise for getting snippy. – Fake Name Dec 28 '12 at 12:21
  • No problem! Glad you worked out your problem! – Martijn Pieters Dec 28 '12 at 12:26
  • @MartijnPieters - On the other hand, I think I've found a bug in `urllib2`. In a file saved with UTF-8 encoding: `urllib2.urlopen("http://✪df.ws" )` works, and `urllib2.urlopen(u"http://✪df.ws")` does not. – Fake Name Dec 28 '12 at 12:28
  • Unicode values for URLs are *not supported*, encode first. The particular example you give uses a unicode character in the domain name, and depending on the source encoding you specified for the python file the raw string you provide could have *any* encoding; in a terminal it depends on the terminal encoding. And the `r` is redundant, it's just another way of spelling out a bytestring where the backslash doesn't from an escape code. – Martijn Pieters Dec 28 '12 at 12:29
  • @MartijnPieters - They're not? That's not mentioned anywhere in the python docs, at least. Assuming UTF-8 urls are supported (and they seem to work, so I think that's a safe assumption), it seems a bit odd for a library to not accept a object containing a valid URL string, when it just needs to be encoded as UTF-8. – Fake Name Dec 28 '12 at 12:32
  • Are there unicode characters that don't fit in UTF-8? Edit: not according to wikipedia. – Fake Name Dec 28 '12 at 12:33
  • Note that the `unicode` type is a relatively late addition to Python, so a lot of documentation doesn't mention it explicitly. If you get UnicodeDecodeError exceptions, it's a safe bet that it's not supported. There are code points in Unicode that UTF-8 is not *supposed* to encode (notably the UTF-16 surrogate halves U+D800 through U+DFFF), and codepoints over U+1FFFFF are officially off-limits to UTF-8. Python can only deal with unicode points up to U+10FFFF in any case, so that's not going to be a problem. – Martijn Pieters Dec 28 '12 at 12:37
  • @MartijnPieters - I see. Anyways, it seems I was bitten by a confluence of factors. Mostly, my test harnesses were producing wierd results because they were *already* UTF-8 encoded, as that is the file encoding I was using. That and the fact that since I'm on windows, I can't print *any* characters to the terminal that are not straight ascii, and I wound up all twisted in knots trying to figure out where my encoding was changing. – Fake Name Dec 28 '12 at 12:43
  • I think I'll see if I can hack unicode url support into urllib, and bug the python people to see if they're interested. – Fake Name Dec 28 '12 at 12:43
  • See [How to display utf-8 in windows console](http://stackoverflow.com/q/3578685) on how to print UTF-8 to the console.. – Martijn Pieters Dec 28 '12 at 12:44
  • I doubt the python people are interested in patching up `urllib` in Python 2.x. 2.7 is the end of the line for Python 2.x, and `urllib` has been refactored in Python 3 already. – Martijn Pieters Dec 28 '12 at 12:45

1 Answers1

3

You need to encode the URL from unicode to a bytestring. u'' and r'' produce two different kinds of objects; a unicode string and a bytestring.

You can encode a unicode string to bytecode with the .encode() method, but you need to know what encoding to use. Usually, for URLs, UTF-8 is great, but you do need to escape the bytes to fit the URL scheme as well:

import urlparse, urllib

parts = list(urlparse.urlsplit(url))
parts[2] = urllib.quote(parts[2].encode('utf8'))
url = urlparse.urlunsplit(parts)

The above example is based on an educated guess that the problem you are facing is due to non-ASCII characters in the path part of the URL, but without further details from you it has to remain a guess.

For domain names, you need to apply the IDNA RFC3490 encoding:

parts = list(urlparse.urlsplit(url))
parts[1] = parts[1].encode('idna')
parts = [p.encode('utf8') if isinstance(p, unicode) else p for p in parts]
url = urlparse.urlunsplit(parts)

See the Python Unicode HOWTO for more information. I also strongly recommend you read the Joel on Software Unicode article as a good primer on the subject of encodings.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • You *really* need to read the Unicode howto and the Joel on Software article. *There are no underlying bytes, only character codepoints*. You *have* to encode to bytes using the `.encode()` method. – Martijn Pieters Dec 28 '12 at 12:08