4

I have the following word that I fetch via a web service: André

From Python, the value looks like: "Andr\u00c3\u00a9". The input is then decoded using json.loads:

>>> import json
>>> json.loads('{"name":"Andr\\u00c3\\u00a9"}')
>>> {u'name': u'Andr\xc3\xa9'}

When I store the above in a utf8 MySQL database, the data is stored like the following using Django:

SomeObject.objects.create(name=u'Andr\xc3\xa9')

Querying the name column from a mysql shell or displaying it in a web page gives: André

The web page displays in utf8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

My database is configured in utf8:

mysql> SHOW VARIABLES LIKE 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci | 
| collation_database   | utf8_unicode_ci | 
| collation_server     | utf8_unicode_ci | 
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

How can I retrieve the word André from a web service, store it properly in a database with no data loss and display it on a web page in its original form?

Thierry Lam
  • 45,304
  • 42
  • 117
  • 144
  • 2
    So far I see no problem here. What you see from Python is the Unicode representation of your string "André". Each character is stored in two bytes. When you store it in MySQL, the database engine automatically converts it to UTF-8 encoding on-the-fly. In UTF-8, some characters are stored using 1 byte only, while others use multiple bytes, but there is no data loss here, the original string can be obtained any time from the UTF-8-encoded representation. – Tamás May 05 '10 at 18:28
  • While not 100% the same as your technology stack, take a look at http://stackoverflow.com/questions/279170/utf-8-all-the-way-through for a good explanation of handling utf-8 losslessly through all the tiers. – ataylor May 05 '10 at 19:05
  • @Tamas: WRONG; look at `u'Andr\xc3\xa9'`; the e-acute is taking up FOUR bytes; see @Bernd's answer – John Machin May 05 '10 at 23:51

1 Answers1

6

The fault is already in the string you pass to json.loads(). \u00c3 is "A tilde" and \00a9 is the copyright sign. Correct for é would be \u00e9.

Probably the string has been encoded in UTF-8 by the sender and decoded as ISO-8859-1 by the receiver.

For example, if you run the following Python script:

# -*- encoding: utf-8 -*-

import json

data = {'name': u'André'}
print('data: {0}'.format(repr(data)))

code = json.dumps(data)
print('code: {0}'.format(repr(code)))

conv = json.loads(code)
print('conv: {0}'.format(repr(conv)))

name = conv['name']
print(u'Name is {0}'.format(name))

The output should look like:

data: {'name': u'Andr\xe9'}
code: '{"name": "Andr\\u00e9"}'
conv: {u'name': u'Andr\xe9'}
Name is André

Managing unicode in Python 2.x can sometimes become a nuisance. Unfortunately, Django does not yet support Python 3.

Bernd Petersohn
  • 2,234
  • 1
  • 15
  • 7
  • +1 but I don't see how Python 3.x stops people doing `rawbytes.decode('latin1')` instead of `rawbytes.decode('utf8')`. Data is NOT stored and transmitted in Unicode; it is still encoded into bytes, and the receiver still needs to know *which* encoding to decode it with. – John Machin May 05 '10 at 23:59
  • What do you use to check the actual value of \u00c3, \00a9 or \u00e9? – Thierry Lam May 06 '10 at 14:17
  • @John Machin: You are right. However, in most cases, the receiver should be able to derive the encoding. For example, if a web server sends out an html page with a form in a particular encoding, practically all browsers will send back the form data in the same encoding. @Thierry Lam: If you can't find a character map viewer under some accessoiries menu on your system, try google for "unicode codepoints". Example site: http://inamidst.com/stuff/unidata/ – Bernd Petersohn May 06 '10 at 18:51