8

I want to write a non-ascii character, lets say to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:

{"foo":"bar"}

I include this because if I just want to print then it seems enough to simply write:

print("→")

and it will do the right thing in python2 and python3.

So I want to print the value of foo together with my non-ascii character . The only way I found to do this such that it works in both, python2 and python3 is:

getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))

or

getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))

It is important to not miss the u in front of because otherwise a UnicodeDecodeError will be thrown by python2.

Using the print function like this:

print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))

doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.

Did I find the best way or is there a better option? Can I make the print function work?

josch
  • 6,716
  • 3
  • 41
  • 49
  • 1
    So `print(data['foo'] + u'→')` doesn't work? – user2357112 May 30 '14 at 00:12
  • @user2357112: Not on my machine. – Dair May 30 '14 at 00:28
  • 1
    For your last example that calls `print`, in Python 3 encoding the string returns `bytes`. Since `print` requires a string, it calls the `__str__` method, which for `bytes` just returns a repr, i.e. `str("→".encode()) == "b'\\xe2\\x86\\x92'"`. Next `print` writes this useless repr to the `file`, but the `BufferedWriter` requires an object that supports the buffer interface, such as `bytes`. – Eryk Sun May 30 '14 at 02:13
  • @eryksun thank you! As `print()` is able to print all kinds of datatypes without explicit conversion to `str` I didnt think it would choke on `bytes`. – josch May 30 '14 at 06:37
  • 2
    Printing has to first get an object as a string. This doesn't choke on Python 3 `bytes`. Decoding `bytes` using a default encoding would be wrong in general, since a `bytes` object isn't necessarily text. I just meant the repr string is "useless" for your needs. What choked is trying to print to a `BufferedWriter`, e.g. `print('abc', file=sys.stdout.buffer)`. – Eryk Sun May 30 '14 at 07:16
  • You can try putting this at top on your Script "# coding=utf8" – Sebastián Olate Bustamante Jun 06 '14 at 16:00
  • @EstebanOlate: That won't ever fix unicode problems with printing. Please don't cargo cult the source encoding hint when you don't understand what it does. – Martijn Pieters Jun 07 '14 at 01:43
  • @MartijnPieters: Wait, what? I keep getting notifications... I think you're confusing me for josch? :) – Dair Jun 07 '14 at 01:46
  • @anon: ick, I am. Re-directing the comments. – Martijn Pieters Jun 07 '14 at 01:49
  • @anon: that said, why do *you* claim `print(data['foo'] + u'→')` doesn't work on your machine? **That works perfectly fine** in a properly configured environment. – Martijn Pieters Jun 07 '14 at 01:50
  • @MartijnPieters: I have a deleted answer. It doesn't work unless I put the proper encoding: `# -*- coding: utf-8 -*-`. – Dair Jun 07 '14 at 01:53
  • What JSON library are you using? What *full* traceback do you get when you use `print(data['foo'] + u'→')`? There should be **no need** to go to these lengths; Python is perfectly capable of printing Unicode to a properly configured terminal or console. – Martijn Pieters Jun 07 '14 at 01:53
  • @anon: sure, you need to make sure that the `u'→'` string literal is correctly interpreted. – Martijn Pieters Jun 07 '14 at 01:53
  • @MartijnPieters: He didn't have it in his question, so I thought it might have been a probable cause, but it wasn't so I deleted it. – Dair Jun 07 '14 at 01:55
  • @MartijnPieters `print(data['foo'] + u'→')` doesn't work on Windows. This has everything to do with `sys.stdout.encoding` and the terminal/shell you're trying to print to. – snapshoe Jun 08 '14 at 14:40
  • @snapshoe: That is a **different** issue, and a duplicate question if that is the case here. No amount of raw UTF-8 writing will fix that issue either. – Martijn Pieters Jun 08 '14 at 14:41
  • @snapshoe: this is why I explicitly state that things work for *a properly configured terminal or console*, to address exactly that issue. – Martijn Pieters Jun 08 '14 at 14:54

2 Answers2

3

The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):

# -*- coding=utf-8 -*-
import codecs
import os
import sys

# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')

if sys.stdout.encoding == 'UTF-8':
    print(output)
else:
    output += os.linesep
    if sys.version_info[0] >= 3:
        sys.stdout.buffer.write(bytes(output.encode('utf-8')))
    else:
        codecs.getwriter('utf-8')(sys.stdout).write(output)

The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.

This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1. It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)

LANG=zh_CN python test.py

It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)

snapshoe
  • 13,454
  • 1
  • 24
  • 28
  • With regard to Windows only sort of working, would changing `'utf-8'` to `sys.stdout.encoding` print any better? – Uyghur Lives Matter Jun 11 '14 at 14:00
  • No. That would be the same as simply doing a print. If you're not changing the encoding, `sys.stdout.encoding` is the one it uses, which is why all the work to change it from it's default. – snapshoe Jun 11 '14 at 14:28
  • As an experiment, try the code [here](http://pastebin.com/DM3hX3Yp). It will show the effect of the encoding used on a terminal for all available encodings-- for ones that don't throw exceptions. I ran this on Windows & Linux, 2.7 & 3.4. – snapshoe Jun 11 '14 at 15:40
  • 1
    I cannot stress enough how important it is to ensure your terminal or console is correctly configured. It should not be Python's job to ensure this. Personally, I'd use `output = output.encode('utf-8')`, `try:`, `sys.stdout.buffer.write(output)`, `except AttributeError:`, `sys.stdout.write(output)`; `codecs.getwriter()` is overkill here, and you need to test for *features*, not versions. You can use the `io` module in Python 2 as well so `sys.stdout` could actually have the `.buffer` attribute there too. – Martijn Pieters Jun 12 '14 at 08:42
  • @MartijnPieters Is there a tutorial or reference on how to correctly configure a console/terminal (cmd/powershell/other?) on Windows? – snapshoe Jun 12 '14 at 23:10
  • For Python 2 [this post](http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash) is very thorough. I believe that for Python 3, the right code page Just Works. – Martijn Pieters Jun 12 '14 at 23:16
  • For the console itself, see [this post on what codepage and font to use](http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how). – Martijn Pieters Jun 12 '14 at 23:20
  • The real beef here is `codecs.getwriter('utf-8')(sys.stdout)`. It took me a detour through [another question](http://stackoverflow.com/a/1169209/874188) to appreciate this. – tripleee Jan 08 '15 at 07:22
1

If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:

# -*- coding=utf-8 -*-
print("bar" + u"→")
Addison
  • 1,065
  • 12
  • 17
  • 1
    This does not work if `sys.stdout.encoding != "UTF-8"`, such as on Windows. – snapshoe Jun 08 '14 at 07:47
  • 1
    @snapshoe It is obvious that it will not be _displayed_ properly if the output goes to something with limited capabilities. But Python does write to the output in UTF-8, and the OP wanted to send the output in a file, it seems. – rds Jun 08 '14 at 09:16
  • 2
    @rds I don't see any mention of outputting to a file. I do see mentioned everywhere, including the title of the post, about printing to stdout. – snapshoe Jun 08 '14 at 14:33