Django query not considers special characters

Question

I've come across a problem with special characters on a django query (Django 1.9.2).

I've created a model that stores a word in it, and I'm feeding that model with words from a Spanish dictionary, using code as follows:

MyModel.objects.get_or_create(word=myword)

And now I've realized that words containing special characters haven't been added, so, for example, there is only one row of MyModel in the database for año and ano! And when I query the database I retrieve the same object for these two queries:

MyModel.objects.get(word='año')
MyModel.objects.get(word='ano')

...and no, those words are not the same ;D

I would want to create one object for each, of course.

uhmmmm... Just wrote this here, and I have realized that field `word` in `MyModel` is marked as `db_index=True`... Maybe a field of this type can store those special characters but when building the hash to create the key/index it discards specials chars? Feature or bug? — jgsogo, Sep 10 '16 at 18:56
did you make sure that your database has utf-8 encoding. By default I think MySQL has latin-8. Also, did you try running an insert query on the database directly, using dbshell. — Swakeert Jain, Sep 10 '16 at 19:46
I haven't, this issue come from an 'almost' production environment and I want to gather as much info as I can before starting to break things. I'm almost sure database is utf-8 (I'll check it), it stores all characters properly, the problem is when there is "collision" between two words, it only stores the first one that arrives. — jgsogo, Sep 10 '16 at 19:51
@SwakeertJain - MySQL has `utf8` and `latin1`; either is sufficient for Spanish (and the rest of Western Europe). However, the encoding is different. — Rick James, Sep 11 '16 at 03:21

score 3 · Answer 1 · edited May 23 '17 at 12:01

3

I have had similar problems before, and was able to solve the issue by setting an environment level variable.

In the context of production level Django, I added this to the top of my wsgi.py file:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

I believe that I found this trick from this answer.

edited May 23 '17 at 12:01

Community

1
1

answered Sep 10 '16 at 20:11

Adam Hopkins

6,837
6
32
52

score 2 · Accepted Answer · answered Sep 11 '16 at 03:31

2

Short answer: You probably want COLLATION utf8_spanish2_ci.

Long answer:

If you are using CHARACTER SET utf8 (or utf8mb4) on the column/table in question, and if you need ano != año, you need COLLATION utf8_bin or utf8_spanish_ci or utf8_spanish2_ci. All other utf8 collations treat n = ñ. spanish2 differs from spanish in that ch is treated as a separate "letter" between c and d. Similarly for ll. More details.

Note that other 'accents' are ignored in comparisons for most utf8 collations except for utf8_bin. For example, C = Ç (except for _bin and _turkish).

answered Sep 11 '16 at 03:31

Rick James

135,179
13
127
222

I need to create a new row for every different word, so I need `n`!=`ñ`, `á`!=`a`, `u`!=`ü`,... I suppose I need to change collation to `utf8_bin` then. (Just for curiosity: `ch` and `ll` are no longer letters since 1994 [RAE](http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abecedario)) – jgsogo Sep 11 '16 at 07:37
Yeah, sounds like utf8_bin -- Watch out; this means that 'A' != 'a'. Thanks for the info on ch and ll. – Rick James Sep 11 '16 at 15:13
You might avoid case problems by _storing_ `LCASE(...)`, then _comparing_ `col = LCASE(input_string)`. This would allow an index to still be useful with `utf8_bin`. – Rick James Sep 11 '16 at 15:22
I moved that column to `utf8_bin` and everything worked as expected, now I have created a copy of that column into the same table with `utf8_general_ci` and the same string. Depending on the type of search I need I use one or the other to retrieve the rows. Thanks a lot! – jgsogo Sep 12 '16 at 19:39
Two columns is a good trick. Usually they are split one for search, one for display. Your use case is 'similar'. – Rick James Sep 12 '16 at 19:44

score 0 · Answer 3 · answered Feb 15 '23 at 11:35

Use PostGreSQL, which currently supports the 'unaccent' extension. This makes searching for 'año' possible when only typing 'ano'.

Best thing is, you can decide whether to use this extension for every filter by, for example using

Person.objects.filter(first_name__unaccent__icontains=search)

Switch your database to PostgreSQL and add the unaccent extension as follows:

Part of answer from @SaeX in another thread: How can I activate the unaccent extension on an already existing model

A migration file needs to be manually made and applied.

First, create an empty migration:

./manage.py makemigrations myapp --empty

Then open the file and add UnaccentExtension to operations:

from django.contrib.postgres.operations import UnaccentExtension


class Migration(migrations.Migration):

    dependencies = [
        (<snip>)
    ]

    operations = [
        UnaccentExtension()
    ]

Now apply the migration using ./manage.py migrate.

If you'd get following error during that last step:

django.db.utils.ProgrammingError: permission denied to create extension "unaccent"
HINT:  Must be superuser to create this extension.

... then temporarily allow superuser rights to your user by performing postgres# ALTER ROLE <user_name> SUPERUSER; and its NOSUPERUSER counterpart. pgAdminIII can do this, too.

Now enjoy the unaccent functionality using Django:

>>> Person.objects.filter(first_name__unaccent=u"Helène")
[<Person: Michels Hélène>]

Again, part of this answer belongs to @SaeX

IMPORTANT

But for me his answer still didn't work, so don't forget to add the line django.contrib.postgresin INSTALLED_APPS (settings.py)

Django query not considers special characters

3 Answers3