Convert from ANSI to UTF-8

Question

I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8

Please note that "lakh" is not a word from standard (US/UK == international) English. Many people outside your corner of the world don't know what a "lakh" is. — Roland, Jul 17 '15 at 08:15

score 9 · Answer 1 · answered Jul 17 '15 at 08:13

9

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)

answered Jul 17 '15 at 08:13

3Ducker

346
1
9

6

Won't reading as utf-8 a different codepage loose some characters ? (I had though you have to read with the correct codepage before writing in another codepage) ? – Tensibai Jul 17 '15 at 08:16
1

From Python specs: _Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing._ – 3Ducker Jul 17 '15 at 08:23
I had to change the read encoding to 'cp1252' to get it to work for me. It still opens with UTF-8 otherwise which gave me an error when encountering a mixed file: `'utf-8' codec can't decode byte 0x92` – ConductedForce Feb 26 '21 at 13:44

score 4 · Answer 2 · answered Dec 19 '18 at 17:27

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
    with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
        while True:
            contents = sourceFile.read(blockSize)
            if not contents:
                break
            targetFile.write(contents)

The below link contains some information on the encoding types that I found on my research

https://docs.python.org/2.4/lib/standard-encodings.html

Convert from ANSI to UTF-8

2 Answers2

Linked