8


I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8

Community
  • 1
  • 1
Karan Pappala
  • 581
  • 2
  • 6
  • 18

2 Answers2

9

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
  lines = file.read()

#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
  file.write(lines)
3Ducker
  • 346
  • 1
  • 9
  • 6
    Won't reading as utf-8 a different codepage loose some characters ? (I had though you have to read with the correct codepage before writing in another codepage) ? – Tensibai Jul 17 '15 at 08:16
  • 1
    From Python specs: _Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing._ – 3Ducker Jul 17 '15 at 08:23
  • I had to change the read encoding to 'cp1252' to get it to work for me. It still opens with UTF-8 otherwise which gave me an error when encountering a mixed file: `'utf-8' codec can't decode byte 0x92` – ConductedForce Feb 26 '21 at 13:44
4

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
    with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
        while True:
            contents = sourceFile.read(blockSize)
            if not contents:
                break
            targetFile.write(contents)

The below link contains some information on the encoding types that I found on my research

https://docs.python.org/2.4/lib/standard-encodings.html

Strebormit
  • 51
  • 1