How to encode string into utf-8 for both Python 2.x and 3.x

Question

I'm trying to write a formatted string containing Cyrillic symbols (in utf-8) to Unix pipe:

sort_proc.stdin.write("{}\n".format(cyrillic_text).decode('utf-8').encode('utf-8'))

I had to encode because 'str' does not support the buffer interface and decode because 'ascii' codec can't decode byte 0xd0. So this code works in Python 2.7 as expected. However Python 3.4 says 'str' object has no attribute 'decode' as string literals in python3 are already "decoded". So I know how to fix it for each version separately but don't know how to fix for both. I've found a solution related to reloading sys module and setting setdefaultencoding, however this article why should we NOT use sys.setdefaultencoding says it's just a hack and shouldn't be used at all. Please post the most pythonic way of doing these things. Thanks.

You can execute different code depending on the version: `if sys.version_info[0]==2: .... else: ....` — DYZ, May 03 '17 at 20:02
What does `.decode('utf-8').encode('utf-8')` achieve? Doesn't look very meaningful. — Stefan Pochmann, May 03 '17 at 20:08
@StefanPochmann I think encode without decode uses default ascii codec, that's why it can't decode `0xd0`. In general it works without encode in python2 but doesn't work in python3 — Alex, May 03 '17 at 20:32
@Alex Sorry, I don't understand. What is `"{}\n".format(cyrillic_text)`? Can you give a small example? I.e., some `s` so that `s.decode('utf-8').encode('utf-8')` isn't the same as `s`? — Stefan Pochmann, May 03 '17 at 20:58
Where does `cyrillic_text` come from? Is a string literal? User input? Text read from a file? — dan04, May 03 '17 at 21:10
I managed to get Unicode data from the file using `codecs.open("path", "r", encoding='utf-8')`, but it failes down the road in third party modules on `b'utf string'.split('python3 unicode separator')` line, so I have to use @DYZ's solution for now — Alex, May 04 '17 at 09:03

score 1 · Accepted Answer · answered May 03 '17 at 20:13

Use unicode strings (instead of the 8-bit str) throughout your Python 2.x code. This is equivalent to Python 3.x str type. Then, you can simply use the_string.encode('UTF-8') to get byte string (of type str in 2.x, but bytes in 3.x).

If you don't need to support Python 3.0 through 3.2, you can prefix all your string literals with u. In Python 2.x, this creates a unicode string, and in 3.3+ it's supported for backwards compatibility but does nothing.

How to encode string into utf-8 for both Python 2.x and 3.x

1 Answers1