How to convert a file to utf-8 in Python?

Question

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part.

I'd like to do the equivalent of:

iconv -t utf-8 $file > converted/$file # this is shell code

Thanks!

score 64 · Accepted Answer · edited Sep 17 '16 at 13:21

64

You can use the codecs module, like this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

EDIT: added BLOCKSIZE parameter to control file chunk size.

edited Sep 17 '16 at 13:21

K DawG

13,287
9
35
66

answered Oct 10 '08 at 13:59

Dzinx

55,586
10
60
78

6

read() will always read the whole file - you probably want .read(BLOCKSIZE), where BLOCKSIZE is some suitable amount to read/write at once. – Brian Oct 10 '08 at 14:21
3

When in Python 3: Consider using `open` instead of `codecs.open` (see [here](https://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python)) – Rafael-WO Jun 08 '21 at 07:26
I run the code, into my test folder. I get this error: Traceback (most recent call last): File "D:\2022_12_02\TEST\convert txt to UTF-8 - versiune 2.py", line 3, in with codecs.open(sourceFileName, "r", "d:\\2022_12_02\\TEST") as sourceFile: NameError: name 'sourceFileName' is not defined – Just Me Mar 05 '23 at 10:38

score 35 · Answer 2 · answered Oct 10 '08 at 14:07

35

This worked for me in a small test:

sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
source = open("source")
target = open("target", "w")

target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))

answered Oct 10 '08 at 14:07

Staale

27,254
23
66
85

Even better would be to specify binary mode. – Arafangion Apr 12 '11 at 00:10
@Arafangion Why binary mode would be better? Thanks! – Honghe.Wu Feb 20 '14 at 14:39
@Honghe.Wu: On windows, text mode is the default, and that means that your line endings will be mangled by the operating system, something you don't want if you're unsure about the encoding on disk. – Arafangion Apr 30 '14 at 02:59
@Arafangion How would the example look like, if I like to specify binary mode? `target = open("target", "wb")` are there some more changes? – The Bndr Mar 23 '15 at 16:32

score 17 · Answer 3 · edited Jul 31 '15 at 23:30

Thanks for the replies, it works!

And since the source files are in mixed formats, I added a list of source formats to be tried in sequence (sourceFormats), and on UnicodeDecodeError I try the next format:

from __future__ import with_statement

import os
import sys
import codecs
from chardet.universaldetector import UniversalDetector

targetFormat = 'utf-8'
outputDir = 'converted'
detector = UniversalDetector()

def get_encoding_type(current_file):
    detector.reset()
    for line in file(current_file):
        detector.feed(line)
        if detector.done: break
    detector.close()
    return detector.result['encoding']

def convertFileBestGuess(filename):
   sourceFormats = ['ascii', 'iso-8859-1']
   for format in sourceFormats:
     try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
      except UnicodeDecodeError:
        pass

def convertFileWithDetection(fileName):
    print("Converting '" + fileName + "'...")
    format=get_encoding_type(fileName)
    try:
        with codecs.open(fileName, 'rU', format) as sourceFile:
            writeConversion(sourceFile)
            print('Done.')
            return
    except UnicodeDecodeError:
        pass

    print("Error: failed to convert '" + fileName + "'.")


def writeConversion(file):
    with codecs.open(outputDir + '/' + fileName, 'w', targetFormat) as targetFile:
        for line in file:
            targetFile.write(line)

# Off topic: get the file list and call convertFile on each file
# ...

(EDIT by Rudro Badhon: this incorporates the original try multiple formats until you don't get an exception as well as an alternate approach that uses chardet.universaldetector)

For tough cases you can try to detect encoding with the chardet module from feedparser.org, but in your case it's an overkill. — itsadok, Oct 13 '08 at 07:17
My Python 3.5 doesn't recognize the function `file`. Where does that come from? — physicalattraction, Oct 26 '16 at 08:24
Yes, this answer was posted 8 years ago, so it's a piece of old Python 2 code. — Sébastien RoccaSerra, Oct 26 '16 at 15:42
I try this code, I tun it, but it doesn't convert ANSI text files to UTF-8... — Just Me, Mar 05 '23 at 10:34

Sole Sensei · Answer 4 · 2018-12-19T13:04:47.287

Answer for unknown source encoding type

based on @Sébastien RoccaSerra

python3.6

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

from_codec = get_encoding_type(srcfile)

# add try: except block for reliability
try: 
    with open(srcfile, 'r', encoding=from_codec) as f, open(trgfile, 'w', encoding='utf-8') as e:
        text = f.read() # for small files, for big use chunks
        e.write(text)

    os.remove(srcfile) # remove old encoding file
    os.rename(trgfile, srcfile) # rename new encoding
except UnicodeDecodeError:
    print('Decode Error')
except UnicodeEncodeError:
    print('Encode Error')

score 8 · Answer 5 · answered Apr 26 '21 at 14:43

8

You can use this one liner (assuming you want to convert from utf16 to utf8)

    python -c "from pathlib import Path; path = Path('yourfile.txt') ; path.write_text(path.read_text(encoding='utf16'), encoding='utf8')"

Where yourfile.txt is a path to your $file.

For this to work you need python 3.4 or newer (probably nowadays you do).

Below a more readable version of the code above

from pathlib import Path
path = Path("yourfile.txt")
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")

answered Apr 26 '21 at 14:43

Cesc

904
1
9
17

Depending on your operating system this may change the line break control characters. Great answer nevertheless, thank you. It needs more upvotes. Simple as that and no need to care about managing resources according to the documentation of Path.write_text: `Open the file in text mode, write to it, and close the file.` – david Jun 10 '21 at 00:08

score 5 · Answer 6 · answered Jan 08 '17 at 17:58

5

This is a Python3 function for converting any text file into the one with UTF-8 encoding. (without using unnecessary packages)

def correctSubtitleEncoding(filename, newFilename, encoding_from, encoding_to='UTF-8'):
    with open(filename, 'r', encoding=encoding_from) as fr:
        with open(newFilename, 'w', encoding=encoding_to) as fw:
            for line in fr:
                fw.write(line[:-1]+'\r\n')

You can use it easily in a loop to convert a list of files.

answered Jan 08 '17 at 17:58

MojiProg

1,992
1
16
8

this worked great for converting from is0-8859-1 to utf-8! – beep_check Apr 27 '20 at 14:04
1

Instead "line[:-1]" it would be better to use line.rstrip('\r\n'). This way no matter what line ending you encounter you will get correct results. – fskoras Feb 22 '22 at 11:33

score 2 · Answer 7 · answered Feb 08 '12 at 19:44

2

To guess what's the source encoding you can use the file *nix command.

Example:

$ file --mime jumper.xml

jumper.xml: application/xml; charset=utf-8

answered Feb 08 '12 at 19:44

Ricardo

618
9
11

It does not answer the question. – Arthur Julião Jan 17 '17 at 18:50

score 1 · Answer 8 · answered Dec 18 '21 at 15:07

convert all file in a dir to utf-8 encode. it is recursive and can filter file by suffix. thanks @Sole Sensei

# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple chardet
import os
import re
from chardet import detect


def get_file_list(d):
    result = []
    for root, dirs, files in os.walk(d):
        dirs[:] = [d for d in dirs if d not in ['venv', 'cmake-build-debug']]
        for filename in files:
            # your filter
            if re.search(r'(\.c|\.cpp|\.h|\.txt)$', filename):
                result.append(os.path.join(root, filename))
    return result


# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        raw_data = f.read()
    return detect(raw_data)['encoding']


if __name__ == "__main__":
    file_list = get_file_list('.')
    for src_file in file_list:
        print(src_file)
        trg_file = src_file + '.swp'
        from_codec = get_encoding_type(src_file)
        try:
            with open(src_file, 'r', encoding=from_codec) as f, open(trg_file, 'w', encoding='utf-8') as e:
                text = f.read()
                e.write(text)
            os.remove(src_file)
            os.rename(trg_file, src_file)
        except UnicodeDecodeError:
            print('Decode Error')
        except UnicodeEncodeError:
            print('Encode Error')

score 0 · Answer 9 · answered Nov 30 '18 at 07:35

This is my brute force method. It also takes care of mingled \n and \r\n in the input.

    # open the CSV file
    inputfile = open(filelocation, 'rb')
    outputfile = open(outputfilelocation, 'w', encoding='utf-8')
    for line in inputfile:
        if line[-2:] == b'\r\n' or line[-2:] == b'\n\r':
            output = line[:-2].decode('utf-8', 'replace') + '\n'
        elif line[-1:] == b'\r' or line[-1:] == b'\n':
            output = line[:-1].decode('utf-8', 'replace') + '\n'
        else:
            output = line.decode('utf-8', 'replace') + '\n'
        outputfile.write(output)
    outputfile.close()
except BaseException as error:
    cfg.log(self.outf, "Error(18): opening CSV-file " + filelocation + " failed: " + str(error))
    self.loadedwitherrors = 1
    return ([])
try:
    # open the CSV-file of this source table
    csvreader = csv.reader(open(outputfilelocation, "rU"), delimiter=delimitervalue, quoting=quotevalue, dialect=csv.excel_tab)
except BaseException as error:
    cfg.log(self.outf, "Error(19): reading CSV-file " + filelocation + " failed: " + str(error))

score 0 · Answer 10 · answered Mar 05 '23 at 10:48

import codecs
import glob

import chardet

ALL_FILES = glob.glob('*.txt')

def kira_encoding_function():
    """Check encoding and convert to UTF-8, if encoding no UTF-8."""
    for filename in ALL_FILES:

        # Not 100% accuracy:
        # https://stackoverflow.com/a/436299/5951529
        # Check:
        # https://chardet.readthedocs.io/en/latest/usage.html#example-using-the-detect-function
        # https://stackoverflow.com/a/37531241/5951529
        with open(filename, 'rb') as opened_file:
            bytes_file = opened_file.read()
            chardet_data = chardet.detect(bytes_file)
            fileencoding = (chardet_data['encoding'])
            print('fileencoding', fileencoding)

            if fileencoding in ['utf-8', 'ascii']:
                print(filename + ' in UTF-8 encoding')
            else:
                # Convert file to UTF-8:
                # https://stackoverflow.com/q/19932116/5951529
                cyrillic_file = bytes_file.decode('cp1251')
                with codecs.open(filename, 'w', 'utf-8') as converted_file:
                    converted_file.write(cyrillic_file)
                print(filename +
                      ' in ' +
                      fileencoding +
                      ' encoding automatically converted to UTF-8')


kira_encoding_function()

SOURCE HERE:

How to convert a file to utf-8 in Python?

10 Answers10

Linked