12

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.

Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?

I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...

For example:

url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

Update If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "classes.py", line 583, in <module>
    wiki.getPage(title)
  File "classes.py", line 146, in getPage
    url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xf1'

I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.

Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.

Justin808
  • 20,859
  • 46
  • 160
  • 265
  • FYI, I just posted a related question that drills down into an aspect of this one: http://stackoverflow.com/questions/12557447/how-can-you-make-python-2-x-warn-when-coercing-strings-to-unicode – Mu Mind Sep 24 '12 at 00:47
  • 6
    Please read http://www.joelonsoftware.com/articles/Unicode.html . Now. A person won't be able to make a working program using _text_ at all nevermind dealing properly with encoding conversions if he doesn't understand at least what is in this article. From your question wording it is clear you are making blind attempts. – jsbueno Sep 24 '12 at 02:04
  • 2
    @jsbueno - I know what unicode is, I know how it works. Python fubard it up to the point where you have to make blind attempts to use it at all. – Justin808 Sep 24 '12 at 02:53
  • No you don't. Pytho's way of using it is quite sane if you _understand_ how it works, as it is nicely explained in the above link. – jsbueno Sep 24 '12 at 04:26
  • BTW..do not take it as an offensive comment, please. Just read the article and you will be more confident not only to the task at hand, andnot jsut to deal with Python text issues. – jsbueno Sep 24 '12 at 04:28

5 Answers5

20

This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:

from __future__ import unicode_literals

This will at least ensure that your own literal strings default to unicode rather than str.

ShankarG
  • 1,105
  • 11
  • 26
18

There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.

Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.


In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:

encoded_title = title
if isinstance(encoded_title, unicode):
    encoded_title = title.encode('utf-8')

If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:

python -Werror -municodenazi myprog.py

This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

Community
  • 1
  • 1
Mu Mind
  • 10,935
  • 4
  • 38
  • 69
  • 1
    Ah well, this worked in the one method, but just moved the error along to another spot. Guess I'll just rewrite everything in another language. I had hoped python would be a useful scripting tool, 3 days later, nope. – Justin808 Sep 24 '12 at 02:48
  • If this gets rid of your error, great! That confirms that your problem is unicode strings getting mixed in with non-unicode. That bad data is still out there somewhere, and the other error is most likely just another symptom of the same original problem. I just updated the answer to mention the unicode-nazi tool, which should let you easily suss out the root problem. – Mu Mind Sep 24 '12 at 03:22
  • Looks like `HTMLParser` doesn't do unicode? I just dont understand why it has to be so very hard to deal with unicode. It should be 100% hidden from the developer, its low level stuff. Heck even obj-c hides it away and eveyrthing just works. – Justin808 Sep 24 '12 at 03:47
  • The `isinstance` thing isn't working for `def handle_data(self, data):` the data returned from the `HTMLParser` class either. – Justin808 Sep 24 '12 at 03:49
  • It _should_ be hidden from the developer in most of the cases you're dealing with, but some people had decided that it would be more important for these things to be fast than correct, and the process to change core libraries is so slow that it's taken this long before python 3 got anything done about it. – Mu Mind Sep 24 '12 at 04:02
  • If you could update your question to include what you're seeing now (particularly any specific problematic strings you can post or error messages), or post them as a separate question perhaps, I'd be happy to keep helping you hunt this down... – Mu Mind Sep 24 '12 at 04:04
  • I just did some experimenting and found that HTMLParser works fine with both unicode strings and encoded UTF-8 strings (`handle_data` gets a `unicode` type in the first case, `str` in the second), but fails if you feed the same parser first a unicode string, then a UTF-8 string. I should mention that in this case, you want to *decode* if it's *not* unicode, which is the inverse of the trick you use for `urllib.quote`, and you should pbb do it *before* passing to `HTMLParser.feed`. Then you should still *encode* before `urllib.quote`, but you can make it unconditional now. – Mu Mind Sep 24 '12 at 04:19
  • I think everything is working... no errors at least. I think some of my unicode is transforming into ansii (accented a to just an a) but I don't think that's possible so I've got to see if some bad data is slipping in from one of my sources. The last errors I had were due to `print`. I finally gave up and just opened a file to write directly to it – Justin808 Sep 24 '12 at 18:13
3

Yes, define your unicode data as unicode literals:

>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'

You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:

# -*- coding: utf-8 -*-

Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).

As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:

>>> 'å'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:

>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'

If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').

If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:

if isinstance(title, unicode):
    title = title.encode('utf-8')

You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")

use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers

yet
  • 773
  • 11
  • 19
2

Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.

Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(

Deina Underhill
  • 557
  • 1
  • 9
  • 23