Compare string alphabetically django db

Question

I have a following model:

class Page(Model):
      book = ForeignKey(Book, on_delete=CASCADE)
      page = IntegerField()
      start = CharField(max_length=350, db_index=True)
      end = CharField(max_length=350, db_index=True)

How do I query DB in order to get pages that "contain" a given word?

page1 = Page.objects.create(start='beaver', end='brother')
page2 = Page.objects.create(start='boy', end='brother')

Page.objects.filter("breast" between start and end) should return page1 and page2.
Page.objects.filter("beast" between start and end) should return nothing.
Page.objects.filter("block" between start and end) should return the only page1, since block is alphabetically after beaver and before brother.

Search should be case-insensitive.

So, I need to write a query that fetches all rows, where start is alphabetically "smaller" than the given word and end is alphabetically "larger" than the given word.

Is `breast`, `block` stored in db? – shafik Jun 21 '19 at 09:41 — shafik, Jun 21 '19 at 09:41

score 2 · Answer 1 · answered Jul 08 '19 at 14:47

2

One option would be to convert all database values for start and end to upper- or lowercase before saving. Then for searching use the filters __gte and __lte (after converting the search term to upper- or lowercase as well).

It seems to work for me (using Python 3.6, Django 2.2, postresql 10):

# create with lowercase words
Page.objects.create(start='beaver', end='brother')
Page.objects.create(start='boy', end='brother')

# filter using lowercase as well
v = request.GET.get('search_term', '').lower()
qs = Page.objects.filter(start__lte=v, end__gte=v)

Or did I misunderstand your question?

answered Jul 08 '19 at 14:47

Ralf

16,086
4
44
68

@AndrewFount I think the order of strings is defined by postgres COLLATE setting and not by Django; read more about it in this similar question https://stackoverflow.com/q/43986882/9225671 – Ralf Jul 08 '19 at 15:36
@AndrewFount do you have a concrete example that is giving you the wrong result when filtering? – Ralf Jul 08 '19 at 15:37

Loïc · Answer 2 · 2019-07-08T15:34:00.290

1

You could convert end and start to decimals.

In your model, use DecimalField instead of CharField.

And then you could use the ascii table to convert your words.

For instance, "love" would translate to : 108 111 118 101

So in database it should have the value : 0.108111118101 (values should be < 1 so that the length of the word doesn't mess with the filtering).

"amour" would translate to : 97 109 111 117 114

Note that the ascii code of "a" has only 2 digits, and all letters must have the same numbers of digits (here 3), so pad it with a 0 if that's the case: 0.097109111117114

Then it's easy to query to see if a decimal is between those or not, using lower than (lt) and greater than (gt)

Notes :

You can use the getters and setters of your models to translate the word to their ascii value and vice-versa.
Convert words to lowercase otherwise it won't work with the ascii table as 'C' has a different value than 'c' for example
Using the ascii table won't work with what's not in the latin alphabet. For instance, ç, é, à, è, ù, will probably break the search, you should consider building your own table, or replacing those letters by their base letter...

Now let's see if "django" is between "amour" and "love":

love   : 0.108111118101
django : 0.100106097110103111 
amour  : 0.097109111117114

Yes it is :)

edited Jul 08 '19 at 15:34

answered Jul 08 '19 at 14:06

Loïc

11,804
1
31
49

2

Start Python, and type `0.100106097110103111 == 0.100106097110103110`. The answer is `True`, even though I've *typed* two *different* numbers (the last digit is different). To put it back into the original words that these numbers would represent: "django" and "djangn" would be the same. The point here is that the limit on number precision is going to be a problem for someone trying to follow this method. Precision needs to be taken care of both on the database side, and on the Python side because problems can occur on both sides. – Louis Jul 08 '19 at 16:36
This approach would works if you keep the array of chars as string, and then compare both. For this operation (**amour** < **love**) `'097109111117114' < '108111118101'` the result will be `True` – Mauro Baraldi Jul 08 '19 at 16:43
@Louis correct me if I'm wrong, but in this case, it depends only on the database. At no point django will try to verify that the query results are "valid". I agree for the other half though, it's important to define the precision on the database side. – Loïc Jul 08 '19 at 17:10
@MauroBaraldi I'm not sure at all that all database work like that. – Loïc Jul 08 '19 at 17:11
2

@Loïc How you convert the strings from strings to numbers? You do that in Python, no? As you put it in your answer: "You can use the getters and setters of your models to translate the word to their ascii value and vice-versa." These setters and getters are written in Python. I can imagine ways to dodge the precision issue in Python, though your answer leaves enough rope for folks to hang themselves (e.g. work in float and then convert to `Decimal` at the last moment). – Louis Jul 08 '19 at 17:20

score 0 · Answer 3 · answered Jul 12 '19 at 04:04

My answer is only applicable to Postgresql, but here may be one solution:

Django with postgresql has a CICharField model field in django.contrib.postgres.fields. This supports indexing on the case-insensitive string as well. The strings will still be stored with their correct case, but comparison operations will be case insensitive.

from django.contrib.postgres.field import CICharField

class Page(Model):
      book = ForeignKey(Book, on_delete=CASCADE)
      page = IntegerField()
      start = CICharField(max_length=350, db_index=True)
      end = CICharField(max_length=350, db_index=True)

This should just about solve your problem, you'll be able to use gte and lte filters and the comparison will be case-insensitive. It should handle unicode just fine depending on the settings in your database.

t1 = "breast"
t2 = "beast"
t3 = "block"
page1 = Page.objects.create(start='beaver', end='brother')
page2 = Page.objects.create(start='boy', end='brother')
Page.objects.filter(start__lte=t1, end__gte=t1)  # <QuerySet [<Page: Page start=beaver, end=brother>, <Page: Page start=boy, end=brother>]>
Page.objects.filter(start__lte=t2, end__gte=t2)  # <QuerySet []>
Page.objects.filter(start__lte=t3, end__gte=t3)  # <QuerySet [<Page: Page start=beaver, end=brother>]>

t4 = "Ù"  # Between Ø and Ú
t5 = "Ü"  # Not between Ø and Ú
page3 = Page.objects.create(start='Ø', end='Ú')
Page.objects.filter(start__lte=t4, end__gte=t4) # <QuerySet [<Page: Page start=Ø, end=Ú>]>
Page.objects.filter(start__lte=t5, end__gte=t5) # <QuerySet []>

This change will generate a migration that installs the CITextExtension() on the database and alters the columns. You might need to separate that migration file into two migrations, where the first would install the CITextExtension and the second would modify your existing columns.

Compare string alphabetically django db

3 Answers3