Format strings with fixed width (unicode and utf8)

Question

I need to parse and output some data in table-like format. The input is in unicode encoding. Here is the test script:

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890'
print '%5s' % s1
print '%5s' % s2

It works as expected in case of the simple call like test.py:

1234567890
 abcd
 αβγδ

But if I try to redirect the output to the file test.py > a.txt, I getting error:

Traceback (most recent call last):
  File "./test.py", line 8, in 
    print '%5s' % s2
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

If I convert strings to UTF-8 encoding, like s2.encode('utf8') redirection works fine, but data positions are broken:

1234567890
 abcd
αβγδ

How to force it to work properly in both cases?

randomir · Accepted Answer · 2017-08-20T15:36:57.750

It boils down to your output stream encoding. In this particular case, since you're using print, the output file used is sys.stdout.

Interactive mode / `stdout` not redirected

When you run Python in the interactive mode, or when you don't redirect stdout to a file, Python uses encoding based on the environment, namely locale environment variables, like LC_CTYPE. For example, if you run your program like this:

$ LC_CTYPE='en_US' python test.py
...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

it will use ANSI_X3.4-1968 for sys.stdout (see sys.stdout.encoding) and fail. However, is you use UTF-8 (as you obviously already do):

$ LC_CTYPE='en_US.UTF-8' python test.py
1234567890
 abcd
 αβγδ

you'll get the expected output.

`stdout` redirected to file

When you redirect stdout to a file, Python will not try to detect encoding from your environment locale, but it will check another environment variable, PYTHONIOENCODING (check the source, initstdio() in Python/pylifecycle.c). For example, this will work as expected:

$ PYTHONIOENCODING=utf-8 python test.py >/tmp/output

since Python will use UTF-8 encoding for /tmp/output file.

Manual `stdout` encoding override

You can also manually re-open sys.stdout with the desired encoding (check this and this SO question):

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

Now print will correctly output str and unicode objects, since the underlying stream writer will convert them to the UTF-8 on fly.

Manual string encoding before output

Of course, you can also manually encode each unicode to UTF-8 str prior to output with:

print ('%5s' % s2).encode('utf8')

but that's tedious and error-prone.

Explicit file open

For completeness: when opening files for writing with a specific encoding (like UTF-8) in Python 2, you should use either io.open or codecs.open because they allow you to specify the encoding (see this question), unlike the built-in open:

from codecs import open
myfile = open('filename', encoding='utf-8')

or:

from io import open
myfile = open('filename', encoding='utf-8')

Thanks for the explanation why it works in the different ways with and without redirection. — Abelisto, Aug 20 '17 at 15:33
You're welcome. I wanted to put all approaches in one place, since I could only find them scattered. — randomir, Aug 20 '17 at 15:38

score 2 · Answer 2 · answered Aug 20 '17 at 14:16

2

You should encode '%5s' % s2 not s2. So the following will have the expected output:

print ('%5s' % s2).encode('utf8')

answered Aug 20 '17 at 14:16

JuniorCompressor

19,631
4
30
57

1

After your answer its become obviously :) Thanks. – Abelisto Aug 20 '17 at 14:19

score 1 · Answer 3 · answered Aug 20 '17 at 14:27

print '%5s' % s1 is correct but print '%5s' % s2 was incorrect. It must print ('%5s' % s2).encode('utf8')

Try this code.

#!/usr/bin/env python

s1 = u'abcd'
s2 = u'\u03b1\u03b2\u03b3\u03b4'

print '1234567890' 
print '%5s' % s1
print ('%5s' % s2).encode('utf8')

Format strings with fixed width (unicode and utf8)

3 Answers3

Interactive mode / stdout not redirected

stdout redirected to file

Manual stdout encoding override

Manual string encoding before output

Explicit file open

Interactive mode / `stdout` not redirected

`stdout` redirected to file

Manual `stdout` encoding override