1

Why does Python add \xe3 in the output of:

>>> b'Transa\xc3\xa7\xc3\xa3o'.decode('utf-8')
'Transaç\xe3o'

Expected value is:

'Transação'

Some more information about my environment:

>>> import sys
>>> print (sys.version)
3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)]   
>>> sys.stdout.encoding
'cp437'

This was under Console 2 + Powershell.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Igor Gatis
  • 4,648
  • 10
  • 43
  • 66

1 Answers1

5

You need to use a console or terminal that supports all of the characters that you want to print.

When printing in the interactive console, the characters are encoded to the correct codec for your console, with any character that is not supported using the backslashreplace error handler to keep the output readable rather than throw an exception. This is a feature of the default sys.displayhook() function:

If repr(value) is not encodable to sys.stdout.encoding with sys.stdout.errors error handler (which is probably 'strict'), encode it to sys.stdout.encoding with 'backslashreplace' error handler.

Your console can handle ç but not ã. There are several codecs that include the first character but not the last; you are using IBM codepage 437, but it is by no means the only one.

If you are running Python in the standard Windows console (cmd.exe) then be aware that Python, Unicode and that console do not mix very well. You can install the win-unicode-console package to make Python 3 use the Windows APIs to better output Unicode text; you'll need to make sure you have a font capable of displaying your Unicode text still.

I don't know for certain if that package is compatible with other Windows shells; your mileage may vary.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • From the look of version string, the OP is running on ms-windows, probably in `cmd.exe`. Not the best terminal out there. Out of the box, it will most definitely *not* be using UTF-8. How to change that has been answered several times on SO. – Roland Smith Jun 08 '15 at 19:34
  • Windows console allows you to print Unicode characters that are unsupported by the current code page. `win-unicode-console` that you've mentioned does it. See [python3 print unicode to windows xp console encode cp437](http://stackoverflow.com/q/28521944/4279) – jfs Jun 09 '15 at 17:59
  • you should probably mention that the "backslashreplace" behavior if true is specific to `repr()` (used by `sys.displayhook()` in REPL) – jfs Jun 09 '15 at 18:01
  • @J.F.Sebastian: Yes, this is `sys.displayhook()` specific; reference added. – Martijn Pieters Jun 09 '15 at 18:07