Ansi to UTF-8 using python causing error

Question

While I was trying to write a python program that converts Ansi to UTF-8, I found this

https://stackoverflow.com/questions/14732996/how-can-i-convert-utf-8-to-ansi-in-python

which converts UTF-8 to Ansi.

I thought it will just work by reversing the order. So I coded

file_path_ansi = "input.txt"
file_path_utf8 = "output.txt"

#open and encode the original content
file_source = open(file_path_ansi, mode='r', encoding='latin-1', errors='ignore')
file_content = file_source.read()
file_source.close

#write 
file_target = open(file_path_utf8, mode='w', encoding='utf-8')
file_target.write(file_content)
file_target.close

But it causes error.

TypeError: file<> takes at most 3 arguments <4 given>

So I changed

file_source = open(file_path_ansi, mode='r', encoding='latin-1', errors='ignore')

to

file_source = open(file_path_ansi, mode='r', encoding='latin-1')

Then it causes another error:

TypeError: 'encoding' is an invalid keyword arguemtn for this function

How should I fix my code to solve this problem?

Martijn Pieters · Accepted Answer · 2020-04-26T11:31:30.403

9

You are trying to use the Python 3 version of the open() function, on Python 2. Between the major versions, I/O support was overhauled, supporting better encoding and decoding.

You can get the same new version in Python 2 as io.open() instead.

I'd use the shutil.copyfileobj() function to do the copying, so you don't have to read the whole file into memory:

import io
import shutil

with io.open(file_path_ansi, encoding='latin-1', errors='ignore') as source:
    with io.open(file_path_utf8, mode='w', encoding='utf-8') as target:
        shutil.copyfileobj(source, target)

Be careful though; most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp1252 instead of latin-1 as the encoding parameter.

edited Apr 26 '20 at 11:31

answered Jul 22 '14 at 16:49

Martijn Pieters

1,048,767
296
4,058
3,343

CP1251 is Cyrillic, it's CP1252 what's similar to ISO 8859-1. – Karol S Jul 22 '14 at 20:06
@MartijnPieters Thank you! But how can I know if my input.txt is written in cp1252 or Latin 1? – user3123767 Jul 23 '14 at 02:15
@user3123767: Latin-1 has control codes in the range 80-9F, while CP-1252 has some more characters there (see the [Wikipedia page on CP-1252](http://en.wikipedia.org/wiki/Windows-1252), look for the table rows `8_` and `9_`). If text decoded as cp1252 works and makes sense, then go for that. Few texts, if any, use the Latin-1 control codes anyway. – Martijn Pieters Jul 23 '14 at 08:06
Please also import shutils and for sake of completeness - if it's the other way round errors are imho far more likely as UTF8 is a superset of ANSI. So if you try the other way round: with io.open(csv_file_path, encoding='utf8', errors='ignore') as source: with io.open(csv_file_path_ansi, mode='w', encoding='cp1252', errors='ignore') as target: shutil.copyfileobj(source, target) – Wolfgang Apr 19 '20 at 09:50
@Wolfgang: This answer specifically addresses the combination of codecs the question was using. It depends entirely on your actual data what codecs would make sense. – Martijn Pieters Apr 26 '20 at 11:31

Ansi to UTF-8 using python causing error

1 Answers1

Linked