Searching in multiple fields respecting the row order

Question

I have a model like the following:

class Foo(models.Model):
    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    owner = models.CharField(max_length=20)
    exists = models.BooleanField()
    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

It is populated with some data:

fruit  stuff  color   owner  exists
Apple  Table   Blue     abc    True
 Pear   Book    Red     xyz   False
 Pear  Phone  Green     xyz   False
Apple  Phone   Blue     abc    True
 Pear  Table  Green     abc    True

I need to merge/join this with a collection (not a queryset):

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

So basically rows 0 and 2 should return when I search this model with this list of tuples.

Currently my workaround is to read Foo.objects.all() into a DataFrame and do a merge with the list of tuples and get the ID's to pass to Foo.objects.filter(). I also tried iterating over the list and calling Foo.object.get() on each tuple but it is very slow. The list is quite big.

When I tried chaining Q's as suggested by the current answers, it threw an OperationalError (too many SQL variables).

My main goal is the following:

As it can be seen from the model these three fields together form my primary key. The table contains around 15k entries. When I get data from another source I need to check if the data is already in my table and create/update/delete accordingly (new data may contain up to 15k entries). Is there a clean and efficient way to check if these records are already in my table?

Note: The list of tuples does not have to be in that shape. I can modify it, turn it into another data structure or transpose it.

@PaoloMelchiorre My trials were on sqlite but I can switch to postgresql. — ayhan, Nov 27 '17 at 21:09
Have you tried `Foo.objects.filter(fruit="Apple").filter(stuff="Table").filter(color="Blue")`. This should basically hit the db once for the first query and do the remaining two queries locally, in a smaller search space. — Abhishek Menon, Nov 29 '17 at 21:13

Satendra · Answer 1 · 2017-11-28T10:16:33.657

You have ('fruit', 'stuff', 'color') field unique together

So if your search tuple is ('Apple', 'Table', 'Blue') and we concatenate it then also it will be a unique string

f = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
c = [''.join(w) for w in f]
# Output: ['AppleTableBlue', 'PearPhoneGreen']

So we can filter queryset on annotations and make use of Concat.

Foo.objects.annotate(u_key=Concat('fruit', 'stuff', 'color', output_field=CharField())).filter(u_key__in=c)
# Output: <QuerySet [<Foo: #0row >, <Foo: #2row>]>

This will work for tuple and list

Transpose case

case 1:

If input is list of 2 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

after transpose input will be:

transpose_input = [('Apple', 'Pear'), ('Table', 'Phone'), ('Blue', 'Green')]

We can easily identify by counting each_tuple_size and input_list_size that the input is transposed. so we can use zip to transpose it again and the above solution will work as expected.

if each_tuple_size == 2 and input_list_size == 3:
    transpose_again = list(zip(*transpose_input))
    #  use *transpose_again* variable further

case 2:

If input is list of 3 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green'), ('Pear', 'Book', 'Red')]

After transpose input will be:

transpose_input = [('Apple', 'Pear', 'Pear'), ('Table', 'Phone', 'Book'), ('Blue', 'Green', 'Red')]

So it is impossible to identify that the input is transposed for every n*n matrix and above solution will Fail

score 2 · Answer 2 · answered Nov 24 '17 at 22:50

2

this is the correct query:

q = Foo.objects.filter(
    Q(fruit='Apple', stuff='Table', color='Blue') |
    Q(fruit='Pear', stuff='Phone', color='Green')
)

also this query will work too (If you don't like Q):

q = Foo.objects.filter(
    fruit='Apple', stuff='Table', color='Blue'
) | Foo.objects.filter(
    fruit='Pear', stuff='Phone', color='Green'
)

answered Nov 24 '17 at 22:50

aliva

5,450
1
33
44

It throws "OperationalError: too many SQL variables". Does chaining Q's together really offer any advantages over and calling `.get()`s? – ayhan Nov 24 '17 at 23:11
1

@ayhan each .get() you call does one query to database so you have n queries, following error behavior on large chain is expected as your query gets grown up, so dividing it in few queries would be performance wise best option – iklinac Nov 24 '17 at 23:21

score 2 · Accepted Answer · answered Nov 28 '17 at 08:15

If you know these fields constitute your natural key and you have to do heavy querying on them, add this natural key as a proper field and take measures to maintain it:

class FooQuerySet(models.QuerySet):
    def bulk_create(self, objs, batch_size=None):
        objs = list(objs)
        for obj in objs:
            obj.natural_key = Foo.get_natural_key(obj.fruit, obj.stuff, obj.color)
        return super(FooQuerySet, self).bulk_create(objs, batch_size=batch_size)

    # you might override update(...) with proper F and Value expressions, 
    # but I assume the natural key does not change

class FooManager(models.Manager):
    def get_queryset(self):
        return FooQuerySet(self.model, using=self._db)

class Foo(models.Model):
    NK_SEP = '|||'  # sth unlikely to occur in the other fields

    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    natural_key = models.CharField(max_length=40, unique=True, db_index=True)

    @staticmethod
    def get_natural_key(*args):
        return Foo.NK_SEP.join(args) 

    def save(self, *args, **kwargs):
        self.natural_key = Foo.get_natural_key(self.fruit, self.stuff, self.color)
        Super(Foo, self).save(*args, **kwargs)

    objects = FooManager()

    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

Now you can query:

from itertools import starmap

lst = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
existing_foos = Foo.objects.filter(natural_key__in=list(starmap(Foo.get_natural_key, lst)))

And batch create:

Foo.objects.bulk_create(
    [
        Foo(fruit=x[0], stuff=x[1], color=x[2]) 
        for x in lst 
        if x not in set(existing_foos.values_list('fruit', 'stuff', 'color'))
    ]
)

score 0 · Answer 4 · answered Nov 24 '17 at 22:49

0

What you did with the Q is AND between all the where in statements

What you wanted to achieve is OR all the Q with tuple attributes set as following

Foo.objects.filter(Q(fruit='Apple',stuff='Pear',color='Blue)|Q...

To do this programmatic you can do something like the following:

tuple = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

query = reduce(lambda q,value: q|Q(fruit=value[0], stuff=value[1], color=value[2]), tuple, Q())  

Foo.objects.filter(query)

answered Nov 24 '17 at 22:49

iklinac

14,944
4
28
30

I didn't use reduce, I tried it with [this](https://stackoverflow.com/a/852481/2285236) but considering they will generate the same number of variables, this also doesn't seem like a feasible alternative. – ayhan Nov 24 '17 at 23:13
@ayhan this solution is at the end having same result as aliva solution or one that you linked just written in different form – iklinac Nov 24 '17 at 23:19

Paulo Scardine · Answer 5 · 2017-11-28T15:10:07.963

This question is likely a manifestation of the X/Y problem. Instead of asking about your problem X you are asking about the solution Y you came up with.

Why are you keeping a counter field in the first place? I mean, why not to remove the count field and query it with:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(count=Count('*'))

Or keep it but use the sum of count instead:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(total=Sum('count'))

If you drop the unique_together constraint all you have to do in order to merge the dataset is to insert your new entries in the database:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(fruit=fruit, stuff=stuff, color=color)

Or assuming collection is a dict of keys and counts:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(
         fruit=fruit, 
         stuff=stuff, 
         color=color,
         count=F('count') + collection[(fruit, stuff, color)],
    )

Please don't answer "it is for performance reasons" unless you have profiled both approaches - in my not-so-humble opinion it is the database's job to keep the score. If you try it and really find a performance problem then a competent DBA will propose a solution (in rare cases it may involve keeping an auxiliary table with the count through the use of database triggers).

My point is, keeping a value that can be calculated by the database is a questionable design. You must have a good reason for it, and you must profile the 'let the database calculate it' approach first - otherwise you risk complicating your design because of imaginary performance reasons.

Anyway I can't think any strategy where you can make this better than O(n) - n being the number of entries in the dataset you want to merge.

Then I may have guessed your original problem wrong so let us know if it is the case.

count field is not related to how many times fruit, stuff, color combinations appear in the table, it is just an attribute of that record that I obtain from another source. — ayhan, Nov 28 '17 at 20:08
Oh, I guess the example would be less confusing without it. Does the advice `update_or_create` still applies? — Paulo Scardine, Nov 28 '17 at 20:12
Let me change that to something else. The collection is quite big. If I use `update_or_create`, it will create a new query for each iteration of the loop and create a lot of overhead. For example if I use `bulk_create` instead of `create` in a loop, the time drops from minutes to milliseconds. Trying to use a single query also fails (see [this comment](https://stackoverflow.com/questions/12661253/how-to-bulk-update-with-django#comment64943030_28551658)). That's why I focus on identifying the rows to be changed hoping I can develop a better strategy based on the percentage that needs update. — ayhan, Nov 28 '17 at 20:22
Sorry, I don't think you can get much better than `O(n)` for this problem unless you use a temporary table and some raw SQL. — Paulo Scardine, Nov 28 '17 at 20:26

Searching in multiple fields respecting the row order

5 Answers5

Transpose case

case 1:

case 2: