4

I need to parse and output some data in table-like format. The input is in unicode encoding. Here is the test script:

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890'
print '%5s' % s1
print '%5s' % s2

It works as expected in case of the simple call like test.py:

1234567890
 abcd
 αβγδ

But if I try to redirect the output to the file test.py > a.txt, I getting error:

Traceback (most recent call last):
  File "./test.py", line 8, in 
    print '%5s' % s2
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

If I convert strings to UTF-8 encoding, like s2.encode('utf8') redirection works fine, but data positions are broken:

1234567890
 abcd
αβγδ

How to force it to work properly in both cases?

Abelisto
  • 14,826
  • 2
  • 33
  • 41

3 Answers3

3

It boils down to your output stream encoding. In this particular case, since you're using print, the output file used is sys.stdout.

Interactive mode / stdout not redirected

When you run Python in the interactive mode, or when you don't redirect stdout to a file, Python uses encoding based on the environment, namely locale environment variables, like LC_CTYPE. For example, if you run your program like this:

$ LC_CTYPE='en_US' python test.py
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

it will use ANSI_X3.4-1968 for sys.stdout (see sys.stdout.encoding) and fail. However, is you use UTF-8 (as you obviously already do):

$ LC_CTYPE='en_US.UTF-8' python test.py
1234567890
 abcd
 αβγδ

you'll get the expected output.

stdout redirected to file

When you redirect stdout to a file, Python will not try to detect encoding from your environment locale, but it will check another environment variable, PYTHONIOENCODING (check the source, initstdio() in Python/pylifecycle.c). For example, this will work as expected:

$ PYTHONIOENCODING=utf-8 python test.py >/tmp/output

since Python will use UTF-8 encoding for /tmp/output file.

Manual stdout encoding override

You can also manually re-open sys.stdout with the desired encoding (check this and this SO question):

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

Now print will correctly output str and unicode objects, since the underlying stream writer will convert them to the UTF-8 on fly.

Manual string encoding before output

Of course, you can also manually encode each unicode to UTF-8 str prior to output with:

print ('%5s' % s2).encode('utf8')

but that's tedious and error-prone.

Explicit file open

For completeness: when opening files for writing with a specific encoding (like UTF-8) in Python 2, you should use either io.open or codecs.open because they allow you to specify the encoding (see this question), unlike the built-in open:

from codecs import open
myfile = open('filename', encoding='utf-8')

or:

from io import open
myfile = open('filename', encoding='utf-8')
randomir
  • 17,989
  • 1
  • 40
  • 55
2

You should encode '%5s' % s2 not s2. So the following will have the expected output:

print ('%5s' % s2).encode('utf8')
JuniorCompressor
  • 19,631
  • 4
  • 30
  • 57
1

print '%5s' % s1 is correct but print '%5s' % s2 was incorrect. It must print ('%5s' % s2).encode('utf8')

Try this code.

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890' 
print '%5s' % s1
print ('%5s' % s2).encode('utf8')
sameera lakshitha
  • 1,925
  • 4
  • 21
  • 29