How to change encoding of characters from file

Question

I have been reading quite a bit about encoding, and I'm still not sure I'm fully wrapping my head around it. I have a file encoded as ANSI with the word "Solluções" in it. I want to convert the file to UTF-8, but whenever I do it changes the characters.

Code:

with codecs.open(filename_in,'r') 
   as input_file, 
   codecs.open(filename_out,'w','utf-8') as output_file:
   output_file.write(input_file.read())

Result: "SolluÃ§Ãµes"

I imagine this is a stupid problem, but I am at an impasse at the moment. I tried to call encode('utf-8') on the individual data in the file prior to writing it to no avail, so I'm guessing that's not correct either... I appreciate any help, thank you!

How are you reading the output file? Are you opening it in a utf-8 editor/viewer? Most editors let you change the encoding with which they open the file. — user3557327, Jun 11 '15 at 16:33

score 1 · Accepted Answer · edited May 23 '17 at 12:05

1

This SO answer to a similar question specifies the input type of the file like codecs.open(sourceFileName, "r", "your-source-encoding"). Without that, python may not interpret the characters correctly if it can't detect the original encoding.

Warning about the encodings: Most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp-1252 instead of latin-1 as your-source-encoding.

edited May 23 '17 at 12:05

Community

1
1

answered Jun 11 '15 at 16:35

Josh Durham

1,632
1
17
28

That did it, it was "cp-1252." Very helpful, thank you! All this encoding business is quite interesting. – Msg Jun 12 '15 at 04:04

score 1 · Answer 2 · answered Jun 11 '15 at 16:37

you can try

  from codecs import encode,decode
  with open(filename_out,"w") as output_file:
       decoded_unicode = decode(input_file.read(),"cp-1252") #im guessing this is what you mean by "ANSI"
       utf8_bytes = encode(decoded_unicode,"utf8")
       output_file.write(utf8_bytes)

How to change encoding of characters from file

2 Answers2