Why does Python String concatenation work with Russian text but string.format() does not

Question

I'm trying to parse (and escape) rows of a CSV file that is stored in Windows-1251 character encoding. Using this excellent answer to deal with this encoding I've ended up with this one line to test the output, for some reason this works:

print(row[0]+','+row[1])

Outputting:

Тяжелый Уборщик Обязанности,1 литр

While this line doesn't work:

print("{0},{1}".format(*row))

Outputting this error:

Name,Variant

Traceback (most recent call last):
  File "Russian.py", line 26, in <module>
    print("{0},{1}".format(*row))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

Here are the first 2 lines of the CSV:

Name,Variant
Тяжелый Уборщик Обязанности,1 литр

and in case it helps, here is the full source of Russian.py:

import csv
import cgi
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

with open('Russian.csv') as csv_file:
    cd_result = charset_detect(csv_file)
    encoding = cd_result['encoding']
    csv_file.seek(0)
    csv_reader = csv.reader(csv_file)
    for bytes_row in csv_reader:
        row = [x.decode(encoding) for x in bytes_row]
        if len(row) >= 6:
            #print(row[0]+','+row[1])
            print("{0},{1}".format(*row))

Because you got lucky. Always treat text as Unicode, except at the edges (decode at ingest, encode when producing your final output). — Martijn Pieters, May 26 '15 at 21:12
My understanding is that `row = [x.decode(encoding) for x in bytes_row]` is producing an array of unicode strings, am I mistaken? — Jason Sperske, May 26 '15 at 21:14
`print(u"{0},{1}".format(*row))`, you are trying to to encode as ascii using str.format. — Padraic Cunningham, May 26 '15 at 21:14
My mind is blown. I am eagerly awaiting the submission of @PadraicCunningham's comment as an answer. — Jason Sperske, May 26 '15 at 21:14
@PadraicCunningham Thanks for destroying my dreams when I thought I got the answer and was so proud of myself! — Zizouz212, May 26 '15 at 21:16
@Zizouz212, lol, well there is still no answer so you could still be proud and add an answer! — Padraic Cunningham, May 26 '15 at 21:18
@Zizouz212, this is true, the answer just as to be correct and first. Something tells me Padraic Cunningham can weather a few stolen answers if you are trying to build up your Python rep :) — Jason Sperske, May 26 '15 at 21:20

score 6 · Accepted Answer · answered May 26 '15 at 21:21

6

The strings in your list were likely already unicode, so you didn't get an issue.

print(row[0]+','+row[1])
Тяжелый Уборщик Обязанности,1 литр

But here we are trying to add unicode to a normal string! That's why you get the UnicodeEncodeError.

print("{0},{1}".format(*row))

So just change it to:

print(u"{0}, {1}".format(*row))

answered May 26 '15 at 21:21

Zizouz212

4,908
5
42
66

This doesn't really explain why the first version works. It's still adding `unicode` to `str`, then adding the result to `unicode`. [Martijn Pieters' answer](http://stackoverflow.com/a/30469184/908494) explains why that's OK (because in this case it ends up decoding the `','`). – abarnert May 26 '15 at 21:35

score 3 · Answer 2 · answered May 26 '15 at 21:23

You are using str.format() which converts unicode() to str() implicitly. It has to do so to be able to interpolate values into the template provided.

Use unicode.format() instead:

print(u"{0},{1}".format(*row))

Note the u before the format literal. unicode.format() has to decode str inputs to fit in the resulting Unicode output.

Concatenation on the other hand can implicitly decode to produce a final unicode() object result. Had your ',' value contained non-ASCII bytes that implicit decoding would also fail.

Moral of the story: use Unicode string literals throughout your code when handling text.

score 0 · Answer 3 · answered May 26 '15 at 21:22

0

the + operand works fine between a unicode string and an str string. On the other hand, str.format doesn't accept unicode strings as parameters.

Thus, you can simply replace the problematic line with the following:

print(u"{0},{1}".format(*row))

That should do the trick.

answered May 26 '15 at 21:22

Hetzroni

2,109
1
14
29

This is wrong. `str.format` _does_ accept `unicode` strings as parameters; it just `encode`s them. And `+` only "works fine" between `unicode` and `str` in the same sense—because it `encode`s one or `decode`s the other. – abarnert May 26 '15 at 21:36

Why does Python String concatenation work with Russian text but string.format() does not

3 Answers3