0

I'm running Ubuntu 10.04 LTS, Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)

>>> m = 'Šiven'
>>> m
'\xa6iven'
>>> unicode(m)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa6 in position 0: ordinal not in range(128)

How should I properly set it (encoding, decoding) so that it writes exactly what it reads?

Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
Kristian
  • 1,751
  • 2
  • 12
  • 9

2 Answers2

6

In Python 2.x, single quotes denote a string of bytes, not characters. You want a character string, which is prefixed with u in 2.x:

>>> m = u'Šiven'
>>> print(m)
Šiven
>>> m.encode('utf-8') # Get the corresponding UTF-8 bytestring
'\xc5\xa0iven'

Note that this only works if your terminal encoding matches your platform's encoding. You should really just set both to UTF-8.

If that's not the case, you should use unicode escapes:

>>> m = u'\u0160iven'
>>> print(m)
Šiven
>>> m.encode('utf-8')
'\xc5\xa0iven'

In a Python file (not a terminal), you can set the encoding according to PEP 263 by starting the file like this:

# -*- coding: utf-8 -*-

You may also want to use Python 3.x, which clears up the confusion between byte and character strings.

phihag
  • 278,196
  • 72
  • 453
  • 469
  • m is from somehing like this M=file.readlines() ... for m in M: ... how can i state in there this: m = u'Šiven' ? – Kristian Mar 03 '12 at 14:06
  • 1
    You don't need the `readlines` in there, you can just iterate over the file (and that will halve your memory requirements). You should really consult [other](http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python) [questions](http://stackoverflow.com/questions/147741/character-reading-from-file-in-python), or, if these questions (and the ones you have searched for) and their top answers don't solve your problem, ask a new question yourself. In short, use [`codecs.open`](http://docs.python.org/library/codecs.html#codecs.open). – phihag Mar 03 '12 at 14:12
0

You probably should put # -*- coding: utf-8 -*- and use editors and everything else in utf-8 mode anyway to avoid these problems, but if you want to find out which encoding suits your current input best, you could try this script (replace 'some string' with something more localized):

encodings = ['ascii', 'cp037', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

def test(s):
    for enc in encodings:
        try:
            u = unicode(s, enc)
            print u, enc
        except: pass

test('some string')

That being said, utf-8 is your friend; use it. :)

Frg
  • 406
  • 2
  • 3