Writing Unicode to HTML File Differs from Plain File

Question

I'm having troubles with character encoding when writing to files with a script. What I'm doing is downloading some information from a website with an API. I have no control over what format I receive the information in, but here's a quick sample:

{'id': 12, 'name': "Kathy \xc3\x93 Fakename"}
{'id': 23, 'name': "Se\xc3\xb1or Murphy"}

(the names there are "Kathy Ó Fakename" and "Señor Example")

This is mostly fine, when I write these to a generic file with no filetype I get them in the proper format with the correct characters.

However I have 2 problems. I'm writing all this information into a html table. When I'm writing to a file with .html as it's ending, the wrong characters are written to the file. Instead I end up getting the names Kathy Ã“ Fakename and SeÃ±or Example. These incorrect characters are also what show up as the actual filename, even though the corrects I want to be there are perfectly valid for filenames.

I believe I verified that the only difference is the filetype, though I am still confused since I didn't expect Python to implicitly adjust what I wrote. Also it definitely is in the source of the HTML, not just how it displays.

To demonstrate, this code:

with open(os.path.abspath("Test.html"),'w') as f:
    for user in users:
        f.write("{}: {}<br>".format(user['id'], user['name']))
with open(os.path.abspath("Test"),'w') as f:
    for user in users:
        f.write("{}: {}\n".format(user['id'], user['name']))

Results in

Test
12: Kathy Ó Fakename
23: Señor Murphy

Test.html
12: Kathy Ã“ Fakename<br>
23: SeÃ±or Murphy<br>

What's causing the difference here?

You are writing UTF-8, but if you are opening the file with a tool that expects Latin 1 or Windows Codepage 1251 then yes, you'll see a [Mojibake](https://en.wikipedia.org/wiki/Mojibake). — Martijn Pieters, Jul 22 '15 at 11:08

score 4 · Accepted Answer · edited May 23 '17 at 12:14

You are writing UTF-8 data, but whatever tool you are using to read the files is decoding the files as Windows CP 1251:

>>> print "Kathy \xc3\x93 Fakename".decode('utf8')
Kathy Ó Fakename
>>> print "Kathy \xc3\x93 Fakename".decode('cp1252')
Kathy Ã“ Fakename
>>> print "Se\xc3\xb1or Murphy".decode('utf8')
Señor Murphy
>>> print "Se\xc3\xb1or Murphy".decode('cp1252')
SeÃ±or Murphy

Use the right tools or tell those tools to use UTF-8 instead. When using HTML, you could include a meta tag to tell tools what codec to use:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
     Kathy Ó Fakename<br />
     Señor Murphy<br />
  </body>
</html>

You may want to read up on Python and Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

I feel foolish, it hadn't occurred to me I was still looking at the HTML source in the browser instead of Notepad++ (where I was looking at the plain file). So this was exactly my mistake, thank you! — SuperBiasedMan, Jul 22 '15 at 11:12

Writing Unicode to HTML File Differs from Plain File

1 Answers1