83

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

I'd like to do the equivalent of:

iconv -t utf-8 $file > converted/$file # this is shell code

Thanks!

Dzinx
  • 55,586
  • 10
  • 60
  • 78
Sébastien RoccaSerra
  • 16,731
  • 8
  • 50
  • 54

10 Answers10

64

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

K DawG
  • 13,287
  • 9
  • 35
  • 66
Dzinx
  • 55,586
  • 10
  • 60
  • 78
  • 6
    read() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once. – Brian Oct 10 '08 at 14:21
  • 3
    When in Python 3: Consider using `open` instead of `codecs.open` (see [here](https://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python)) – Rafael-WO Jun 08 '21 at 07:26
  • I run the code, into my test folder. I get this error: Traceback (most recent call last): File "D:\2022_12_02\TEST\convert txt to UTF-8 - versiune 2.py", line 3, in with codecs.open(sourceFileName, "r", "d:\\2022_12_02\\TEST") as sourceFile: NameError: name 'sourceFileName' is not defined – Just Me Mar 05 '23 at 10:38
35

This worked for me in a small test:

sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")

target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
Staale
  • 27,254
  • 23
  • 66
  • 85
  • Even better would be to specify binary mode. – Arafangion Apr 12 '11 at 00:10
  • @Arafangion Why binary mode would be better? Thanks! – Honghe.Wu Feb 20 '14 at 14:39
  • @Honghe.Wu: On windows, text mode is the default, and that means that your line endings will be mangled by the operating system, something you don't want if you're unsure about the encoding on disk. – Arafangion Apr 30 '14 at 02:59
  • @Arafangion How would the example look like, if I like to specify binary mode? `target = open("target", "wb")` are there some more changes? – The Bndr Mar 23 '15 at 16:32
17

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement

import os
import sys
import codecs
from chardet.universaldetector import UniversalDetector

targetFormat = 'utf-8'
outputDir = 'converted'
detector = UniversalDetector()

def get_encoding_type(current_file):
    detector.reset()
    for line in file(current_file):
        detector.feed(line)
        if detector.done: break
    detector.close()
    return detector.result['encoding']

def convertFileBestGuess(filename):
   sourceFormats = ['ascii', 'iso-8859-1']
   for format in sourceFormats:
     try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
      except UnicodeDecodeError:
        pass

def convertFileWithDetection(fileName):
    print("Converting '" + fileName + "'...")
    format=get_encoding_type(fileName)
    try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
    except UnicodeDecodeError:
        pass

    print("Error: failed to convert '" + fileName + "'.")


def writeConversion(file):
    with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
        for line in file:
            targetFile.write(line)

# Off topic: get the file list and call convertFile on each file
# ...

(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)

Foon
  • 6,148
  • 11
  • 40
  • 42
Sébastien RoccaSerra
  • 16,731
  • 8
  • 50
  • 54
15

Answer for unknown source encoding type

based on @Sébastien RoccaSerra

python3.6

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

from_codec = get_encoding_type(srcfile)

# add try: except block for reliability
try: 
    with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e:
        text = f.read() # for small files, for big use chunks
        e.write(text)

    os.remove(srcfile) # remove old encoding file
    os.rename(trgfile, srcfile) # rename new encoding
except UnicodeDecodeError:
    print('Decode Error')
except UnicodeEncodeError:
    print('Encode Error')
Sole Sensei
  • 369
  • 3
  • 8
8

You can use this one liner (assuming you want to convert from utf16 to utf8)

    python -c "from pathlib import Path; path = Path('yourfile.txt') ; path.write_text(path.read_text(encoding='utf16'), encoding='utf8')"

Where yourfile.txt is a path to your $file.

For this to work you need python 3.4 or newer (probably nowadays you do).

Below a more readable version of the code above

from pathlib import Path
path = Path("yourfile.txt")
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")
Cesc
  • 904
  • 1
  • 9
  • 17
  • Depending on your operating system this may change the line break control characters. Great answer nevertheless, thank you. It needs more upvotes. Simple as that and no need to care about managing resources according to the documentation of Path.write_text: `Open the file in text mode, write to it, and close the file.` – david Jun 10 '21 at 00:08
5

This is a Python3 function for converting any text file into the one with UTF-8 encoding. (without using unnecessary packages)

def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'):
    with open(filename, 'r', encoding=encoding_from) as fr:
        with open(newFilename, 'w', encoding=encoding_to) as fw:
            for line in fr:
                fw.write(line[:-1]+'\r\n')

You can use it easily in a loop to convert a list of files.

MojiProg
  • 1,992
  • 1
  • 16
  • 8
  • this worked great for converting from is0-8859-1 to utf-8! – beep_check Apr 27 '20 at 14:04
  • 1
    Instead "line[:-1]" it would be better to use line.rstrip('\r\n'). This way no matter what line ending you encounter you will get correct results. – fskoras Feb 22 '22 at 11:33
2

To guess what's the source encoding you can use the file *nix command.

Example:

$ file --mime jumper.xml

jumper.xml: application/xml; charset=utf-8
Ricardo
  • 618
  • 9
  • 11
1

convert all file in a dir to utf-8 encode. it is recursive and can filter file by suffix. thanks @Sole Sensei

# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple chardet
import os
import re
from chardet import detect


def get_file_list(d):
    result = []
    for root, dirs, files in os.walk(d):
        dirs[:] = [d for d in dirs if d not in ['venv', 'cmake-build-debug']]
        for filename in files:
            # your filter
            if re.search(r'(\.c|\.cpp|\.h|\.txt)$', filename):
                result.append(os.path.join(root, filename))
    return result


# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        raw_data = f.read()
    return detect(raw_data)['encoding']


if __name__ == "__main__":
    file_list = get_file_list('.')
    for src_file in file_list:
        print(src_file)
        trg_file = src_file + '.swp'
        from_codec = get_encoding_type(src_file)
        try:
            with open(src_file, 'r', encoding=from_codec) as f, open(trg_file, 'w', encoding='utf-8') as e:
                text = f.read()
                e.write(text)
            os.remove(src_file)
            os.rename(trg_file, src_file)
        except UnicodeDecodeError:
            print('Decode Error')
        except UnicodeEncodeError:
            print('Encode Error')
jamlee
  • 1,234
  • 1
  • 13
  • 26
0

This is my brute force method. It also takes care of mingled \n and \r\n in the input.

    # open the CSV file
    inputfile = open(filelocation, 'rb')
    outputfile = open(outputfilelocation, 'w', encoding='utf-8')
    for line in inputfile:
        if line[-2:] == b'\r\n' or line[-2:] == b'\n\r':
            output = line[:-2].decode('utf-8', 'replace') + '\n'
        elif line[-1:] == b'\r' or line[-1:] == b'\n':
            output = line[:-1].decode('utf-8', 'replace') + '\n'
        else:
            output = line.decode('utf-8', 'replace') + '\n'
        outputfile.write(output)
    outputfile.close()
except BaseException as error:
    cfg.log(self.outf, "Error(18): opening CSV-file " + filelocation + " failed: " + str(error))
    self.loadedwitherrors = 1
    return ([])
try:
    # open the CSV-file of this source table
    csvreader = csv.reader(open(outputfilelocation, "rU"), delimiter=delimitervalue, quoting=quotevalue, dialect=csv.excel_tab)
except BaseException as error:
    cfg.log(self.outf, "Error(19): reading CSV-file " + filelocation + " failed: " + str(error))
0
import codecs
import glob

import chardet

ALL_FILES = glob.glob('*.txt')

def kira_encoding_function():
    """Check encoding and convert to UTF-8, if encoding no UTF-8."""
    for filename in ALL_FILES:

        # Not 100% accuracy:
        # https://stackoverflow.com/a/436299/5951529
        # Check:
        # https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function
        # https://stackoverflow.com/a/37531241/5951529
        with open(filename, 'rb') as opened_file:
            bytes_file = opened_file.read()
            chardet_data = chardet.detect(bytes_file)
            fileencoding = (chardet_data['encoding'])
            print('fileencoding', fileencoding)

            if fileencoding in ['utf-8', 'ascii']:
                print(filename + ' in UTF-8 encoding')
            else:
                # Convert file to UTF-8:
                # https://stackoverflow.com/q/19932116/5951529
                cyrillic_file = bytes_file.decode('cp1251')
                with codecs.open(filename, 'w', 'utf-8') as converted_file:
                    converted_file.write(cyrillic_file)
                print(filename +
                      ' in ' +
                      fileencoding +
                      ' encoding automatically converted to UTF-8')


kira_encoding_function()

SOURCE HERE:

Just Me
  • 864
  • 2
  • 18
  • 28