Django, SQLite - Accurate ordering of strings with accented letters

Question

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.

My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.

My 2nd try:
I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.

author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))

How could I get the queryset ordered correctly?

[Create a collation](https://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.create_collation) that does the [correct sorting in Python](http://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python). — CL., May 20 '14 at 12:18
I like the link you gave me. Pyuca seems good. Unfortunately I was not able to use it in sorting queryset. It works for strings but in `author_list = sorted(Author.objects.all(), key=lambda x: (Collator().sort_key, x.lastname, x.firstname))` Python ignores the Collator. I have not found any way sort a queryset by it's attributes based on the Collator :( — bardosd, May 20 '14 at 14:32

score 2 · Accepted Answer · edited May 23 '17 at 11:45

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.

# import section
from pyuca import Collator

# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()

# ...

# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))

The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.

Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

score 0 · Answer 2 · answered May 20 '14 at 12:34

0

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.

Using a property to provide get/set methods could help in keeping the fields in sync.

answered May 20 '14 at 12:34

loopbackbee

21,962
10
62
97

I considered storing unidecoded version of the firstname and lastname fields after I experienced this issue. I decided to find another way to solve it because this solution generates more overhead (storage and I/O) than a properly working collation process. I am not sure how significant the difference is but I prefer to maintain optimal performance than choosing the easiest way to develop the app. Of course if no better solution will be found I will accept your solution. – bardosd May 20 '14 at 13:56

Django, SQLite - Accurate ordering of strings with accented letters

2 Answers2