Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Question

I'm reading a config file in python getting sections and creating new config files for each section.

However.. I'm getting a decode error because one of the strings contains Español=spain

self.output_file.write( what.replace( " = ", "=", 1 ) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

How would I adjust my code to allow for encoded characters such as these? I'm very new to this so please excuse me if this is something simple..

class EqualsSpaceRemover:
    output_file = None
    def __init__( self, new_output_file ):
        self.output_file = new_output_file

    def write( self, what ):
        self.output_file.write( what.replace( " = ", "=", 1 ) )

def get_sections():
    configFilePath = 'C:\\test.ini'
    config = ConfigParser.ConfigParser()
    config.optionxform = str
    config.read(configFilePath)
    for section in config.sections():
        configdata = {k:v for k,v in config.items(section)}
        confignew = ConfigParser.ConfigParser()
        cfgfile = open("C:\\" + section + ".ini", 'w')
        confignew.add_section(section)
        for x in configdata.items():
            confignew.set(section,x[0],x[1])
        confignew.write( EqualsSpaceRemover( cfgfile ) )
        cfgfile.close()

check if `what.replace( " = ", "=", 1 ).encode('utf-8')` works — mic4ael, Aug 29 '16 at 13:54
I just tested and it gave me the following: `self.output_file.write( what.replace( " = ", "=", 1 ).encode('utf-8') ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)` — Ranga Sarin, Aug 29 '16 at 13:59
Sorry if I'm being stupid but what do you mean? this is the first time I've worked with encoding — Ranga Sarin, Aug 29 '16 at 14:19
what if you open the file with `utf-8`? like `import codecs;codecs.open("C:\\" + section + ".ini", 'w', encoding='utf-8'))` — mic4ael, Aug 29 '16 at 14:21
This seems python2 to me. If so, please add the appropriate tag, as unicode handling is completely different between python2 and python3. Are you using `from __future__ import unicode_literals`? That would explain why you get an UnicodeDecodeError — mata, Aug 29 '16 at 14:29
@ mic4eal that still produced the same error. @mata you are correct and by removing `from __future__ import unicode_literals` fixed the issue! Thank you so much. — Ranga Sarin, Aug 29 '16 at 14:35

mata · Accepted Answer · 2018-02-08T14:29:45.477

If you use python2 with from __future__ import unicode_literals then every string literal you write is an unicode literal, as if you would prefix every literal with u"...", unless you explicitly write b"...".

This explains why you get an UnicodeDecodeError on this line:

what.replace(" = ", "=", 1)

because what you actually do is

what.replace(u" = ",u"=",1 )

ConfigParser uses plain old str for its items when it reads a file using the parser.read() method, which means what will be a str. If you use unicode as arguments to str.replace(), then the string is converted (decoded) to unicode, the replacement applied and the result returned as unicode. But if what contains characters that can't be decoded to unicode using the default encoding, then you get an UnicodeDecodeError where you wouldn't expect one.

So to make this work you can

use explicit prefixes for byte strings: what.replace(b" = ", b"=", 1)
or remove the unicode_litreals future import.

Generally you shouldn't mix unicode and str (python3 fixes this by making it an error in almost any case). You should be aware that from __future__ import unicode_literals changes every non prefixed literal to unicode and doesn't automatically change your code to work with unicode in all case. Quite the opposite in many cases.

This doesn't seem like the "right" way of solving the problem. This solution ignores that the character encoding of the text and hopes for the best. A better solution is to make sure the config files (both reading and writing) are opened with the correct character encoding (which appears to be utf-8 [according to OP's deleted answer]). By default, py2 uses the OS' default encoding (which appears to be something other than utf-8). — Dunes, Aug 29 '16 at 15:22
@Dunes - In python2 `ConfigParser` doesn't assume any encoding on a config file, it's read as binary file (at least when using the `read(path)` method), and the data is stored as bytes (`str`) internally and written back as bytes. You can use something like `parser.readfp(codecs.open(path, encoding='utf-8'))`, then unicode will be used for everything, but like many other modules it's intended and documented to be used with `str`. In python3 it's a different story, there it only works with unicode. — mata, Aug 29 '16 at 15:46

Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

1 Answers1