Django: Natural Sort QuerySet

Question

I am looking for a way to naturally sort Django's QuerySets. I found a similar question, but it did not focus on QuerySets. Instead they are doing it directly in Python.

So here is my problem. Let's say I have this model:

class Item(models.Model):
    signature = models.CharField('Signatur', max_length=50)

In the Django Admin Interface, I want to use a filter, which sorts them alphanumeric. Currently, they are sorted this way:

What I'd expect is a list of ["BA 1", "BA 2", ...]. I found admin.SimpleListFilter in the official documentation, which sounds quite suitable. But what I get in the queryset() function is a QuerySet, which can not be sorted in a natural way, because it does not contain the elements, but only the query to the database.

The order_by method on QuerySet gives the same ordering as it can be seen in the image. Is there a way to manipulate the QuerySet to get it naturally sorted?

My code so far:

class AlphanumericSignatureFilter(admin.SimpleListFilter):
    title = 'Signature (alphanumeric)'
    parameter_name = 'signature_alphanumeric'

    def lookups(self, request, model_admin):
        return (
            ('signature', 'Signature (alphanumeric)'),
        )

    def queryset(self, request, queryset: QuerySet):
        return queryset.order_by('signature')

How can I transform the QuerySet to get my desired output? Or is there a different way? The Django Admin Interface is really powerful, which is why I want to use it as long as it is possible. But this feature is really missing.

I am currently using Django 1.11

Any help, comments or hints are appreciated. Thanks for your help.

Alexandr Shurigin · Accepted Answer · 2019-12-09T12:08:55.310

That's not Django's bug actually, that's how databases work internally and for example looks like MySql for example doesn't have natural sort by default (I googled not a lot, so maybe I am wrong there). But we can use some workaround for the case.

I put everything with examples & screenshots at https://gist.github.com/phpdude/8a45e1bd2943fa806aeffee94877680a

But basically for the given models.py file

from django.db import models


class Item(models.Model):
    signature = models.CharField('Signatur', max_length=50)

    def __str__(self):
        return self.signature

I've used admin.py just for example with the correct filter implementation

from django.contrib.admin import ModelAdmin, register, SimpleListFilter
from django.db.models.functions import Length, StrIndex, Substr, NullIf, Coalesce
from django.db.models import Value as V

from .models import Item


class AlphanumericSignatureFilter(SimpleListFilter):
    title = 'Signature (alphanumeric)'
    parameter_name = 'signature_alphanumeric'

    def lookups(self, request, model_admin):
        return (
            ('signature', 'Signature (alphanumeric)'),
        )

    def queryset(self, request, queryset):
        if self.value() == 'signature':
            return queryset.order_by(
                Coalesce(Substr('signature', V(0), NullIf(StrIndex('signature', V(' ')), V(0))), 'signature'),
                Length('signature'),
                'signature'
            )


@register(Item)
class Item(ModelAdmin):
    list_filter = [AlphanumericSignatureFilter]

Screenshots with examples

A few references:

PS: It looks like db function Length(column_name) was added on Django 1.9, so you should be able to use it, but generally any Django version supports custom db ORM function call and you can call length() function of the field.

Extra example with using Python library `natsort`

It will work, but requires to load all the possible signatures before for correct sort since it sorts the rows list using python side, not DB side.

It works. But it could be pretty slow in case of a big table size.

From my point of view it should be used only on db tables sizes less than 50 000 rows (for example, depends on your DB server performance & etc).

from django.contrib.admin import ModelAdmin, register, SimpleListFilter
from django.db.models.functions import StrIndex, Concat
from django.db.models import Value as V
from natsort import natsorted

from .models import Item


class AlphanumericTruePythonSignatureFilter(SimpleListFilter):
    title = 'Signature (alphanumeric true python)'
    parameter_name = 'signature_alphanumeric_python'

    def lookups(self, request, model_admin):
        return (
            ('signature', 'Signature (alphanumeric)'),
        )

    def queryset(self, request, queryset):
        if self.value() == 'signature':
            all_ids = list(queryset.values_list('signature', flat=True))
            # let's use "!:!" as a separator for signature values
            all_ids_sorted = "!:!" + "!:!".join(natsorted(all_ids))

            return queryset.order_by(
                StrIndex(V(all_ids_sorted), Concat(V('!:!'), 'signature')),
            )


@register(Item)
class Item(ModelAdmin):
    list_filter = [AlphanumericTruePythonSignatureFilter]

And one more screenshot example for the case

It's not a 100% correct solution but I think this is the best so far. It fails for example when you have more mixed formats such as `BA 0001`, `BA 0010`, `BA 30` which libraries such as `natsort` does sort correctly. — Oskar Persson, Dec 09 '19 at 10:10
Of course, that's actually a very interesting question. I will try to check deeper with more examples & formattings about the case. Have a few more ideas — Alexandr Shurigin, Dec 09 '19 at 10:12
I'm thinking maybe some regex solution could be made here to split the different parts and somehow sort on the capture groups. Not sure how though in all the different database backends. For example with `(.*)\D(?:0*)(?!$)(\d*)$` you get the prefix and suffix as separate groups: https://regex101.com/r/iasgsz/1 — Oskar Persson, Dec 09 '19 at 10:26
@OskarPersson check it now, it works perfectly I believe and will cover at least 99% of cases I believe! This solution should work with any database backend since it doesn't use any special db functions — Alexandr Shurigin, Dec 09 '19 at 10:31
But now `BA 10001`, `BA 11`, `BA 2` and `BA 512` are out of order, right? — Oskar Persson, Dec 09 '19 at 10:38
:facepalm: :) missed that row, was focused on the different prefixes solution — Alexandr Shurigin, Dec 09 '19 at 10:45
@OskarPersson ok, finally I think we are good with the solution. Check for example screenshots please — Alexandr Shurigin, Dec 09 '19 at 11:13
it works for strings with and without spaces as well (take a look at `BAC` value in the list) — Alexandr Shurigin, Dec 09 '19 at 11:14
@AlexandrShurigin Yeah it is better but it doesn't sort for example `BA 0001`, `BA 0010`, `BA 30` "correctly", see fiddle here: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=3305435ce0de9e198b74066fbb26736e — Oskar Persson, Dec 09 '19 at 11:39
of course, it doesn't since no good natively supported "natural sort function" exists in the DBs :( to sort natively we should use custom solutions on lists or "custom solutions" for every "custom data format", but generally it looks like my solution will work. With leading zeros it will not, but without it will :) — Alexandr Shurigin, Dec 09 '19 at 11:45
@OskarPersson what about the solution which includes python side natsort? ;) Do you like it?! — Alexandr Shurigin, Dec 09 '19 at 12:09
not agree! from my experience, scaling depends on your servers count available! :D no way to solve it in a general way for any DB driver & etc — Alexandr Shurigin, Dec 09 '19 at 12:28
@AlexandrShurigin, I do believe your starting approach can be easily generalized to cope with leading 0s ... I'll post an answer to have better formatting support to express this. I thank you and Oscar Persson for the interesting discussion and for introducing me to this exciting django.db.models.functions module I wasn't aware of. I will never ever use RawSQL again ;) — Mario Orlandi, Dec 09 '19 at 20:59

Mario Orlandi · Answer 2 · 2019-12-07T11:27:25.807

If you don’t mind to target a specific database, you can use RawSQL() to inject a SQL expression for parsing your “signature” field, then annotate the recordset with the result; for example (PostgreSQL):

queryset = (
    Item.objects.annotate(
        right_part=RawSQL("cast(split_part(signature, ' ', 2) as int)", ())
    ).order_by('right_part')
)

(In case you needed to support different database formats, you could additionally detect the active engine and supply a suitable expression accordingly)

The nice thing about RawSQL() is that you make very explicit when and where you’re applying a database-specific feature.

As noted by @schillingt, Func() may also be an options. On the other side, I would avoid extra() as it might be very well deprecated (see: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#extra).

Proof (for PostgreSQL):

class Item(models.Model):
    signature = models.CharField('Signatur', max_length=50)

    def __str__(self):
        return self.signature

-----------------------------------------------------

import django
from django.db.models.expressions import RawSQL
from pprint import pprint
from backend.models import Item


class ModelsItemCase(django.test.TransactionTestCase):

    def test_item_sorting(self):
        signatures = [
            'BA 1',
            'BA 10',
            'BA 100',
            'BA 2',
            'BA 1002',
            'BA 1000',
            'BA 1001',
        ]
        for signature in signatures:
            Item.objects.create(signature=signature)
        pprint(list(Item.objects.all()))
        print('')

        queryset = (
            Item.objects.annotate(
                right_part=RawSQL("cast(split_part(signature, ' ', 2) as int)", ())
            ).order_by('right_part')
        )

        pprint(list(queryset))

        self.assertEqual(queryset[0].signature, 'BA 1')
        self.assertEqual(queryset[1].signature, 'BA 2')
        self.assertEqual(queryset[2].signature, 'BA 10')
        self.assertEqual(queryset[3].signature, 'BA 100')
        self.assertEqual(queryset[4].signature, 'BA 1000')
        self.assertEqual(queryset[5].signature, 'BA 1001')
        self.assertEqual(queryset[6].signature, 'BA 1002')

Result:

test_item_sorting (backend.tests.test_item.ModelsItemCase) ... [<Item: BA 1>,
 <Item: BA 10>,
 <Item: BA 100>,
 <Item: BA 2>,
 <Item: BA 1002>,
 <Item: BA 1000>,
 <Item: BA 1001>]

[<Item: BA 1>,
 <Item: BA 2>,
 <Item: BA 10>,
 <Item: BA 100>,
 <Item: BA 1000>,
 <Item: BA 1001>,
 <Item: BA 1002>]
ok

----------------------------------------------------------------------
Ran 1 test in 0.177s

This seems to work great for this specific example but I think an accepted answer should aim for a more generic solution, both regarding multiple database backends and more value patterns — Oskar Persson, Dec 09 '19 at 08:34
@OskarPersson, Generally speaking I do agree with you, and most times, if not always, I rather write generic and database-agnostic code; this is certainly a big plus given by the ORM. However, this might come with a cost: sometimes, you are forced to denormalize data and/or loose some effective database-specific optimizations. Ultimately, the tradeoff depends on the single application, and you might encounter cases where, by giving up some generality, you gain a practical advantage for your project. — Mario Orlandi, Dec 09 '19 at 09:40
A sort of compromise could be to add a View to the database to expose the underlying data in a more convenient way, then mirror the db-view with a Django Model marked with Meta.managed=False; but this would probably be overkill in the use case assigned by the original question. — Mario Orlandi, Dec 09 '19 at 09:41
@OskarPersson: on second thought, I do agree that RawSQL IS NOT the way to go here ;) I learned A LOT from this talk ... thank you for your comment ;) — Mario Orlandi, Dec 10 '19 at 07:25

score 3 · Answer 3 · answered Dec 09 '19 at 01:11

A simple approach is to add another field that is used only for sorting:

class Item(models.Model):
    signature = models.CharField('Signatur', max_length=50)
    sort_string = models.CharField(max_length=60, blank=True, editable=False)

    class Meta:
        ordering = ['sort_string']

    def save(self, *args, **kwargs):
        parts = self.signature.split()
        parts[2] = "{:06d}".format(int(parts[2]))
        self.sort_string = "".join(parts)
        super().save(*args, **kwargs)

Depending on how often your data is updated vs read, this is probably very efficient. sort_string is calculated once whenever an Item is updated, but then it is available as a simple field whenever it is needed. It is simple to adjust the way sort_string is calculated to meet your exact requirements.

It may also be useful (particularly during development) to add a re-save action to your admin:

def re_save(modeladmin, request, queryset):
    for item in queryset:
        item.save()
re_save.short_description = "Re-save"

class ItemAdmin(admin.ModelAdmin):
    actions = [re_save, ]
    ....

so it is easy to trigger re-calculation.

score 2 · Answer 4 · answered Dec 06 '19 at 17:27

I'm assuming your signature field follows this pattern: AAA 123 letters followed by a space followed by numbers (int).

Item.objects.extra(select={
    's1': 'cast(split_part(signature, \' \', 2) as int)', 
    's2': 'split_part(signature, \' \', 1)'
}).order_by('s2', 's1')

score 1 · Answer 5 · answered Dec 05 '19 at 16:04

1

How do you get your naming BA 1, BA 1000 ...etc, the easiest solution is to store your data as this, BA 0001, BA 0002, then use order by, that will work. else you have to apply a mapper, with python in order to transform your list and reorder it with a python logic.

answered Dec 05 '19 at 16:04

gxmad

1,650
4
23
29

I think you also should consider the cases where you have no control over how the data is stored. And doing the sorting in Python is definitely not always a scalable nor viable solution – Oskar Persson Dec 05 '19 at 16:10
in that case, you have the choise of editing how jango gets data, by direclty injecting sql code. like done in the django doc : https://docs.djangoproject.com/en/2.2/topics/db/sql/#mapping-query-fields-to-model-fields – gxmad Dec 05 '19 at 16:14
Yes, and I think the answer on how to use those tools to achieve the solution to the problem in this question is what both @n2o and myself are searching for – Oskar Persson Dec 05 '19 at 16:17

score 1 · Answer 6 · answered Dec 05 '19 at 16:14

I figured this would be simple solution, but apparently it's not. Kudos to you on the good question. This is the approach I suggest:

Read up on how others have solved this at the Postgres / DB level and determine the best way to handle it for yourself. Do you need a custom type, can you use a simple regex, etc
Depending on the above, implement that solution for Postgres in a Django migration. You may need to create a type which can be done via a custom SQL migration. Or maybe you need to create a function at the database level.
Utilize the new postgres artifact. This part will definitely be complicated. You may need use .extra or a Functo access the function or type.

This should be possible, but it'll definitely involve some DB changes and non-typical django usage.

Mario Orlandi · Answer 7 · 2019-12-10T07:07:26.593

Elaborating further my previous proposal and the interesting solution given by @Alexandr Shurigin, I'm now suggesting another option.

This new solution splits "signature" into two field:

code: a variable length alfanumeric string
weigth: a numeric value, possibly with leading 0s to be ignored

Given:

    [
        'X 1',
        'XY 1',
        'XYZ 1',
        'BA 1',
        'BA 10',
        'BA 100',
        'BA 2',
        'BA 1002',
        'BA 1000',
        'BA 1001',
        'BA 003',
    ]

the expected result is:

    [
        'BA 1',
        'BA 2',
        'BA 003',
        'BA 10',
        'BA 100',
        'BA 1000',
        'BA 1001',
        'BA 1002',
        'X 1',
        'XY 1',
        'XYZ 1',
    ]

All computations are delegated to the database in a generic way, thanks to django.db.models.functions module.

    queryset = (
        Item.objects.annotate(
            split_index=StrIndex('signature', Value(' ')),
        ).annotate(
            left=Substr('signature', Value(1), 'split_index', output_field=CharField()),
            right=Substr('signature', F('split_index'), output_field=CharField()),
        ).annotate(
            code=Trim('left'),
            weight=Cast('right', output_field=IntegerField())
        ).order_by('code', 'weight')
    )

A more compact, but also less readable solution, is this:

    queryset = (
        Item.objects.annotate(
            split_index=StrIndex('signature', Value(' ')),
        ).annotate(
            code=Trim(Substr('signature', Value(1), 'split_index', output_field=CharField())),
            weight=Cast(Substr('signature', F('split_index'), output_field=CharField()), output_field=IntegerField())
        ).order_by('code', 'weight')
    )

What I'm really missing here is a "IndexOf" function to compute "split_index" as the position of either the first space OR digit, thus giving a really Natural Sort behaviour (to accept, for example "BA123" as well as "BA 123")

Proof:

import django
#from django.db.models.expressions import RawSQL
from pprint import pprint
from backend.models import Item
from django.db.models.functions import Length, StrIndex, Substr, Cast, Trim
from django.db.models import Value, F, CharField, IntegerField


class ModelsItemCase(django.test.TransactionTestCase):

    def test_item_sorting(self):

        signatures = [
            'X 1',
            'XY 1',
            'XYZ 1',
            'BA 1',
            'BA 10',
            'BA 100',
            'BA 2',
            'BA 1002',
            'BA 1000',
            'BA 1001',
            'BA 003',
        ]
        for signature in signatures:
            Item.objects.create(signature=signature)
        print(' ')
        pprint(list(Item.objects.all()))
        print('')

        expected_result = [
            'BA 1',
            'BA 2',
            'BA 003',
            'BA 10',
            'BA 100',
            'BA 1000',
            'BA 1001',
            'BA 1002',
            'X 1',
            'XY 1',
            'XYZ 1',
        ]

        queryset = (
            Item.objects.annotate(
                split_index=StrIndex('signature', Value(' ')),
            ).annotate(
                code=Trim(Substr('signature', Value(1), 'split_index', output_field=CharField())),
                weight=Cast(Substr('signature', F('split_index'), output_field=CharField()), output_field=IntegerField())
            ).order_by('code', 'weight')
        )
        pprint(list(queryset))

        print(' ')
        print(str(queryset.query))
        self.assertSequenceEqual(
            [row.signature for row in queryset],
            expected_result
        )

The resulting query, for sqlite3 is:

SELECT 
    "backend_item"."id", 
    "backend_item"."signature", 
    INSTR("backend_item"."signature",  ) AS "split_index", 
    TRIM(SUBSTR("backend_item"."signature", 1, INSTR("backend_item"."signature",  ))) AS "code", 
    CAST(SUBSTR("backend_item"."signature", INSTR("backend_item"."signature",  )) AS integer) AS "weight" 
FROM "backend_item" 
ORDER BY "code" ASC, "weight" ASC

and for PostgreSQL:

SELECT 
    "backend_item"."id", 
    "backend_item"."signature", 
    STRPOS("backend_item"."signature",  ) AS "split_index", 
    TRIM(SUBSTRING("backend_item"."signature", 1, STRPOS("backend_item"."signature",  ))) AS "code", 
    (SUBSTRING("backend_item"."signature", STRPOS("backend_item"."signature",  )))::integer AS "weight" 
FROM "backend_item" 
ORDER BY "code" ASC, "weight" ASC

mmmmhhh must be a refuse .. let me check and correct this ;) should have been "end of string" — Mario Orlandi, Dec 09 '19 at 21:02
good catch ;) now fixed. length=None means "end of string" for Substr — Mario Orlandi, Dec 09 '19 at 21:58
A problem with this is that you are limited to having whitespace as separator. "BA 1", "BA 2", "BA 10", "BA1", "BA2", "BA10" does not sort correctly. — Oskar Persson, Dec 13 '19 at 09:42
Though combining this with the answer by @alexandr-shurigin might work. By just adding ordering by length of the original string to the end of the ordering fields I got the example above to sort correctly — Oskar Persson, Dec 13 '19 at 09:46
@OskarPersson: Right ! I'm aware of the limitation. Not sure whether the string length can really be usefull in the most general case: a "left" alphabetic code of variable length, followed by zero or more spaces, then a numeric code possibly starting with zero or more '0's. What could be useful here would be detecting the position of the either the first space or the first digit, to split the original string accordingly. I couldn't find any help for that in django.db.models.functions. — Mario Orlandi, Dec 16 '19 at 14:03

score 0 · Answer 8 · answered Dec 10 '19 at 15:49

Assuming the format for the signature field is fixed (with a single space and second part is numeric: [^ ]+ \d+), we can split it into two parts - base_name (string) and sig_value (int).

Also you don't need the SimpleListFilter (it has a different purpose - to create filters!). You can simply override the get_queryset method:

from django.contrib import admin
from django.db.models import F, IntegerField, TextField, Value
from django.db.models.functions import Cast, StrIndex, Substr

from .models import Item

@admin.register(Item)
class ItemAdmin(admin.ModelAdmin):
    def get_queryset(self, request):
        qs = super(ItemAdmin, self).get_queryset(request)
        return qs.annotate(
            # 1-indexed position of space
            space=StrIndex("name", Value(" ")),

            # part of text before the space
            base_name=Substr("name", 1, F("space") - 1, output_field=TextField()),

            # cast part of text after the space as int
            sig_value=Cast(Substr("name", F("space")), IntegerField()),
        ).order_by("base_name", "sig_value")

Django: Natural Sort QuerySet

8 Answers8

Extra example with using Python library natsort

Extra example with using Python library `natsort`