2

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:

print(label, msg)
... in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>

If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:

print(label, msg.encode())

This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.

Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)

Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*

However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())

b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'

I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.

Jon Coombs
  • 2,135
  • 2
  • 25
  • 26
  • If you're running on Windows, see http://stackoverflow.com/questions/4942305/why-dont-scripting-languages-output-unicode-to-the-windows-console – dan04 Aug 16 '12 at 13:29
  • 1
    Python 3 does not default to UTF-8. It does default to Unicode, but that is an quite different beast. Read or watch Ned Batchelder's awesome [Pragmatic Unicode](http://bit.ly/unipain). –  Aug 17 '12 at 02:15
  • Just to clarify, I wasn't claiming that Python defaults to UTF-8 (since internal representations are not 'encoded'), but I was assuming that since encode() does, that print() does as well. Thanks to thg435 for clarifying that for print() it depends on the output device. – Jon Coombs Dec 13 '12 at 16:26

2 Answers2

6

print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).

georg
  • 211,518
  • 52
  • 313
  • 390
  • Thanks; that helps. And thanks to whoever linked to the very relevant topic, "Why don't scripting languages output Unicode to the Windows console?". I had searched before asking this question but hadn't thought to search on "Windows console". – Jon Coombs Aug 17 '12 at 07:31
2

The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.

To do so, everytime you read something, use decode, everytime you output something, use encode.

For you regex, use the re.UNICODE flag.

I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • I'm not sure what you mean by "easier", since I was saying it would be easier to use print without encode than to use both. Also, I thought I was already "only using unicode" in my script and files. Of course, the data must be "encoded" while it's on disk. I read the files in as "utf-8" so I don't think I need to use decode, but so far I'm not seeing a way around using encode with every print statement. I'll look into the re.UNICODE flag. – Jon Coombs Aug 16 '12 at 10:02
  • I indicated that you might want to "use only unicode" as your regex seem to not be using unicode, thus indicating that you might not really be "using only unicode, although you might think so. – Thomas Orozco Aug 16 '12 at 12:37
  • No, my regexes are all using either just plain text (which is valid unicode) or else include special characters directly. They're all saved in a UTF-8 text file. However, I'm suspecting that it might be better to escape them with the regex backslash notation; I'm not sure. – Jon Coombs Aug 17 '12 at 07:21
  • 1
    It's just that if you don't use the re.UNICODE flag and you try to apply your regex (the re object, not the string from which it was constructed) ; to a unicode string, you can get funky results. For instance, some characters might not get matched when you'd expect them to. – Thomas Orozco Aug 17 '12 at 08:00
  • +1 on this answer because it seems to be the correct answer for Python 2.x, though not for Python 3.x. And for the helpful comment about re.UNICODE, or (?u), because that will help me with some other issues (getting \w and \b to work properly with unicode). – Jon Coombs Dec 13 '12 at 16:37