1

I have been reading quite a bit about encoding, and I'm still not sure I'm fully wrapping my head around it. I have a file encoded as ANSI with the word "Solluções" in it. I want to convert the file to UTF-8, but whenever I do it changes the characters.

Code:

with codecs.open(filename_in,'r') 
   as input_file, 
   codecs.open(filename_out,'w','utf-8') as output_file:
   output_file.write(input_file.read())

Result: "Solluções"

I imagine this is a stupid problem, but I am at an impasse at the moment. I tried to call encode('utf-8') on the individual data in the file prior to writing it to no avail, so I'm guessing that's not correct either... I appreciate any help, thank you!

bumble_bee_tuna
  • 3,533
  • 7
  • 43
  • 83
Msg
  • 142
  • 2
  • 8

2 Answers2

1

This SO answer to a similar question specifies the input type of the file like codecs.open(sourceFileName, "r", "your-source-encoding"). Without that, python may not interpret the characters correctly if it can't detect the original encoding.

Warning about the encodings: Most people talking about ANSI refer to one of the Windows codepages; you may really have a file in CP (codepage) 1252, which is almost, but not quite the same thing as ISO-8859-1 (Latin 1). If so, use cp-1252 instead of latin-1 as your-source-encoding.

Community
  • 1
  • 1
Josh Durham
  • 1,632
  • 1
  • 17
  • 28
  • That did it, it was "cp-1252." Very helpful, thank you! All this encoding business is quite interesting. – Msg Jun 12 '15 at 04:04
1

you can try

  from codecs import encode,decode
  with open(filename_out,"w") as output_file:
       decoded_unicode = decode(input_file.read(),"cp-1252") #im guessing this is what you mean by "ANSI"
       utf8_bytes = encode(decoded_unicode,"utf8")
       output_file.write(utf8_bytes)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179