11

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.

#!/usr/bin/env python

import os
import sys
import shutil

def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')

# try to open the file and exit if some IOError occurs
try:
    f = open(filename, 'r').read()
except Exception:
    sys.exit(1)

# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
    try:
        # try to decode the file with the first encoding
        # from the tuple.
        # if it succeeds then it will reach break, so we
        # will be out of the loop (something we want on
        # success).
        # the data variable will hold our decoded text
        data = f.decode(enc)
        break
    except Exception:
        # if the first encoding fail, then with the continue
        # keyword will start again with the second encoding
        # from the tuple an so on.... until it succeeds.
        # if for some reason it reaches the last encoding of
        # our tuple without success, then exit the program.
        if enc == encodings[-1]:
            sys.exit(1)
        continue

# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)

# and at last convert it to utf-8
f = open(filename, 'w')
try:
    f.write(data.encode('utf-8'))
except Exception, e:
    print e
finally:
    f.close()

How can I do that?

Thank you

Chilledrat
  • 2,593
  • 3
  • 28
  • 38
Alex
  • 34,581
  • 26
  • 91
  • 135

3 Answers3

23
import codecs

f = codecs.open(filename, 'r', 'cp1251')
u = f.read()   # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u)   # and now the contents have been output as UTF-8

Is this what you intend to do?

buruzaemon
  • 3,847
  • 1
  • 23
  • 44
  • I think you are very close! I managed to read the data from XML, but when I write it to the file, I get weird characters instead of Cyrillic ones. – Alex Apr 27 '11 at 16:26
  • YES! I got it! I was using cp1252 instead. Thank you so much – Alex Apr 27 '11 at 16:30
  • @Alex, glad to know you got your code working. You may want to have a look at http://www.evanjones.ca/python-utf8.html, some good tips there. – buruzaemon Apr 27 '11 at 16:38
0

If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:

import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)

This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

Will McCutchen
  • 13,047
  • 3
  • 44
  • 43
  • I still get an error UnicodeEncodeError: 'charmap' codec can't encode characters in position: character maps to – Alex Apr 27 '11 at 16:04
  • What is the actual code you're using? What line triggers this exception? – Will McCutchen Apr 27 '11 at 16:08
  • 1
    @Will McCutchen - you should use `'rb'` as the mode. Overriding it to `'r'` is almost never what you want to do. – D.Shawley Apr 27 '11 at 16:18
  • @Alex: are you sure that it is encoded as CP1251 and not ISO-8859-5 or some other code page? Try using `encodings.cp1251.StreamReader` to read the input. – D.Shawley Apr 27 '11 at 16:18
  • @D.Shawley, the codecs module always opens files in binary mode. – Will McCutchen Apr 27 '11 at 16:27
  • @Will McCutchen - I wasn't sure if that was the case or not. Despite what the documentation states, it will only force to binary mode when an encoding is specified. In this case, it will switch to binary mode regardless of what you specify. If you drop the `'cp1251'` argument, then it will open in text mode. Just out of curiosity, why not use `'rb'` explicitly or use `codecs.open('input.txt', encoding='cp1251')` and let the default _do its thing_? – D.Shawley Apr 28 '11 at 22:53
  • @D.Shawley, you're right, `codecs.open('input.txt', encoding='cp1251')` is the Right Way to do it... Updating my answer for posterity. – Will McCutchen Apr 29 '11 at 15:20
0

This is just a guess, since you didn't specify what you mean by "doesn't work".

If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622