0

I'm having troubles with character encoding when writing to files with a script. What I'm doing is downloading some information from a website with an API. I have no control over what format I receive the information in, but here's a quick sample:

{'id': 12, 'name': "Kathy \xc3\x93 Fakename"}
{'id': 23, 'name': "Se\xc3\xb1or Murphy"}

(the names there are "Kathy Ó Fakename" and "Señor Example")

This is mostly fine, when I write these to a generic file with no filetype I get them in the proper format with the correct characters.

However I have 2 problems. I'm writing all this information into a html table. When I'm writing to a file with .html as it's ending, the wrong characters are written to the file. Instead I end up getting the names Kathy Ó Fakename and Señor Example. These incorrect characters are also what show up as the actual filename, even though the corrects I want to be there are perfectly valid for filenames.

I believe I verified that the only difference is the filetype, though I am still confused since I didn't expect Python to implicitly adjust what I wrote. Also it definitely is in the source of the HTML, not just how it displays.

To demonstrate, this code:

with open(os.path.abspath("Test.html"),'w') as f:
    for user in users:
        f.write("{}: {}<br>".format(user['id'], user['name']))
with open(os.path.abspath("Test"),'w') as f:
    for user in users:
        f.write("{}: {}\n".format(user['id'], user['name']))

Results in

Test
12: Kathy Ó Fakename
23: Señor Murphy

Test.html
12: Kathy Ó Fakename<br>
23: Señor Murphy<br>

What's causing the difference here?

SuperBiasedMan
  • 9,814
  • 10
  • 45
  • 73

1 Answers1

4

You are writing UTF-8 data, but whatever tool you are using to read the files is decoding the files as Windows CP 1251:

>>> print "Kathy \xc3\x93 Fakename".decode('utf8')
Kathy Ó Fakename
>>> print "Kathy \xc3\x93 Fakename".decode('cp1252')
Kathy Ó Fakename
>>> print "Se\xc3\xb1or Murphy".decode('utf8')
Señor Murphy
>>> print "Se\xc3\xb1or Murphy".decode('cp1252')
Señor Murphy

Use the right tools or tell those tools to use UTF-8 instead. When using HTML, you could include a meta tag to tell tools what codec to use:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
     Kathy Ó Fakename<br />
     Señor Murphy<br />
  </body>
</html>

You may want to read up on Python and Unicode:

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    I feel foolish, it hadn't occurred to me I was still looking at the HTML source in the browser instead of Notepad++ (where I was looking at the plain file). So this was exactly my mistake, thank you! – SuperBiasedMan Jul 22 '15 at 11:12