12

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.

I am calling this method to get the country name:

country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name

I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?

rnevius
  • 26,578
  • 10
  • 58
  • 86
Maverick
  • 2,738
  • 24
  • 91
  • 157

3 Answers3

3

Django stores unicode string using code points and identifies the string as unicode for further processing. UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point. In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.

The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex

Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'

If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã… See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex

See code snippet with html notation of these characters.

<div id="test">&#xC3;&#x85;&#xC5;</div>

So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.

FIRST EDIT:

To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template. And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:

country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)

SECOND EDIT:

One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:

from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict') 

But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings? What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.

Julien Grégoire
  • 16,864
  • 4
  • 32
  • 57
  • I used `country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8')` in the code but still it rendered as `Ã…land`. I am using render method to get the template. – Maverick Jun 09 '15 at 12:23
  • See edit, I'm supposing country_label goes straight to the context and isn't saved in db before being rendered? – Julien Grégoire Jun 09 '15 at 14:02
  • @mad_programmer What happens if you pass encoding argument to `unicode()`, like this: `unicode(fields.Country(...).name, 'UTF-8')`? – xyres Jun 09 '15 at 18:04
  • @xyres Your solution gives error `TypeError: coercing to Unicode: need string or buffer, __proxy__ found` – Maverick Jun 10 '15 at 10:31
  • @JulienGrégoire no the solution you suggested doesnt work, it still gives the same string. Yes, it goes to context through render method and I use it directly in template. Not getting stored anywhere in db. – Maverick Jun 10 '15 at 10:32
0

try:

from __future__ import unicode_literals #Place as first import.

AND / OR

country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('latin1').decode('utf8')
alexisdevarennes
  • 5,437
  • 4
  • 24
  • 38
  • both the solutions dont work. The second one gives exception. In the second option I get `UnicodeDecodeError UnicodeDecodeError: 'utf8' codec can't decode byte 0xc5 in position 0: invalid continuation byte ` – Maverick Jun 04 '15 at 10:30
0

Just this this week I encountered a similar encoding error. I believe the problem is because the machine encoding is differ with the one on Python. Try to add this to your .bashrc or .zshrc.

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Then, open up a new terminal and run the Django app again.

Edwin Lunando
  • 2,726
  • 3
  • 24
  • 33