9

I have a model like the following:

class Foo(models.Model):
    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    owner = models.CharField(max_length=20)
    exists = models.BooleanField()
    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

It is populated with some data:

fruit  stuff  color   owner  exists
Apple  Table   Blue     abc    True
 Pear   Book    Red     xyz   False
 Pear  Phone  Green     xyz   False
Apple  Phone   Blue     abc    True
 Pear  Table  Green     abc    True

I need to merge/join this with a collection (not a queryset):

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

So basically rows 0 and 2 should return when I search this model with this list of tuples.

Currently my workaround is to read Foo.objects.all() into a DataFrame and do a merge with the list of tuples and get the ID's to pass to Foo.objects.filter(). I also tried iterating over the list and calling Foo.object.get() on each tuple but it is very slow. The list is quite big.

When I tried chaining Q's as suggested by the current answers, it threw an OperationalError (too many SQL variables).

My main goal is the following:

As it can be seen from the model these three fields together form my primary key. The table contains around 15k entries. When I get data from another source I need to check if the data is already in my table and create/update/delete accordingly (new data may contain up to 15k entries). Is there a clean and efficient way to check if these records are already in my table?

Note: The list of tuples does not have to be in that shape. I can modify it, turn it into another data structure or transpose it.

ayhan
  • 70,170
  • 20
  • 182
  • 203

5 Answers5

5

You have ('fruit', 'stuff', 'color') field unique together

So if your search tuple is ('Apple', 'Table', 'Blue') and we concatenate it then also it will be a unique string

f = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
c = [''.join(w) for w in f]
# Output: ['AppleTableBlue', 'PearPhoneGreen']

So we can filter queryset on annotations and make use of Concat.

Foo.objects.annotate(u_key=Concat('fruit', 'stuff', 'color', output_field=CharField())).filter(u_key__in=c)
# Output: <QuerySet [<Foo: #0row >, <Foo: #2row>]>

This will work for tuple and list

Transpose case

case 1:

If input is list of 2 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

after transpose input will be:

transpose_input = [('Apple', 'Pear'), ('Table', 'Phone'), ('Blue', 'Green')]

We can easily identify by counting each_tuple_size and input_list_size that the input is transposed. so we can use zip to transpose it again and the above solution will work as expected.

if each_tuple_size == 2 and input_list_size == 3:
    transpose_again = list(zip(*transpose_input))
    #  use *transpose_again* variable further

case 2:

If input is list of 3 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green'), ('Pear', 'Book', 'Red')]

After transpose input will be:

transpose_input = [('Apple', 'Pear', 'Pear'), ('Table', 'Phone', 'Book'), ('Blue', 'Green', 'Red')]

So it is impossible to identify that the input is transposed for every n*n matrix and above solution will Fail

Satendra
  • 6,755
  • 4
  • 26
  • 46
2

this is the correct query:

q = Foo.objects.filter(
    Q(fruit='Apple', stuff='Table', color='Blue') |
    Q(fruit='Pear', stuff='Phone', color='Green')
)

also this query will work too (If you don't like Q):

q = Foo.objects.filter(
    fruit='Apple', stuff='Table', color='Blue'
) | Foo.objects.filter(
    fruit='Pear', stuff='Phone', color='Green'
)
aliva
  • 5,450
  • 1
  • 33
  • 44
  • It throws "OperationalError: too many SQL variables". Does chaining Q's together really offer any advantages over and calling `.get()`s? – ayhan Nov 24 '17 at 23:11
  • 1
    @ayhan each .get() you call does one query to database so you have n queries, following error behavior on large chain is expected as your query gets grown up, so dividing it in few queries would be performance wise best option – iklinac Nov 24 '17 at 23:21
2

If you know these fields constitute your natural key and you have to do heavy querying on them, add this natural key as a proper field and take measures to maintain it:

class FooQuerySet(models.QuerySet):
    def bulk_create(self, objs, batch_size=None):
        objs = list(objs)
        for obj in objs:
            obj.natural_key = Foo.get_natural_key(obj.fruit, obj.stuff, obj.color)
        return super(FooQuerySet, self).bulk_create(objs, batch_size=batch_size)

    # you might override update(...) with proper F and Value expressions, 
    # but I assume the natural key does not change

class FooManager(models.Manager):
    def get_queryset(self):
        return FooQuerySet(self.model, using=self._db)

class Foo(models.Model):
    NK_SEP = '|||'  # sth unlikely to occur in the other fields

    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    natural_key = models.CharField(max_length=40, unique=True, db_index=True)

    @staticmethod
    def get_natural_key(*args):
        return Foo.NK_SEP.join(args) 

    def save(self, *args, **kwargs):
        self.natural_key = Foo.get_natural_key(self.fruit, self.stuff, self.color)
        Super(Foo, self).save(*args, **kwargs)

    objects = FooManager()

    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

Now you can query:

from itertools import starmap

lst = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
existing_foos = Foo.objects.filter(natural_key__in=list(starmap(Foo.get_natural_key, lst)))

And batch create:

Foo.objects.bulk_create(
    [
        Foo(fruit=x[0], stuff=x[1], color=x[2]) 
        for x in lst 
        if x not in set(existing_foos.values_list('fruit', 'stuff', 'color'))
    ]
)
user2390182
  • 72,016
  • 6
  • 67
  • 89
0

What you did with the Q is AND between all the where in statements

What you wanted to achieve is OR all the Q with tuple attributes set as following

Foo.objects.filter(Q(fruit='Apple',stuff='Pear',color='Blue)|Q...

To do this programmatic you can do something like the following:

tuple = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

query = reduce(lambda q,value: q|Q(fruit=value[0], stuff=value[1], color=value[2]), tuple, Q())  

Foo.objects.filter(query)
iklinac
  • 14,944
  • 4
  • 28
  • 30
  • I didn't use reduce, I tried it with [this](https://stackoverflow.com/a/852481/2285236) but considering they will generate the same number of variables, this also doesn't seem like a feasible alternative. – ayhan Nov 24 '17 at 23:13
  • @ayhan this solution is at the end having same result as aliva solution or one that you linked just written in different form – iklinac Nov 24 '17 at 23:19
0

This question is likely a manifestation of the X/Y problem. Instead of asking about your problem X you are asking about the solution Y you came up with.

Why are you keeping a counter field in the first place? I mean, why not to remove the count field and query it with:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(count=Count('*'))

Or keep it but use the sum of count instead:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(total=Sum('count'))

If you drop the unique_together constraint all you have to do in order to merge the dataset is to insert your new entries in the database:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(fruit=fruit, stuff=stuff, color=color)

Or assuming collection is a dict of keys and counts:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(
         fruit=fruit, 
         stuff=stuff, 
         color=color,
         count=F('count') + collection[(fruit, stuff, color)],
    )

Please don't answer "it is for performance reasons" unless you have profiled both approaches - in my not-so-humble opinion it is the database's job to keep the score. If you try it and really find a performance problem then a competent DBA will propose a solution (in rare cases it may involve keeping an auxiliary table with the count through the use of database triggers).

My point is, keeping a value that can be calculated by the database is a questionable design. You must have a good reason for it, and you must profile the 'let the database calculate it' approach first - otherwise you risk complicating your design because of imaginary performance reasons.

Anyway I can't think any strategy where you can make this better than O(n) - n being the number of entries in the dataset you want to merge.

Then I may have guessed your original problem wrong so let us know if it is the case.

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • count field is not related to how many times fruit, stuff, color combinations appear in the table, it is just an attribute of that record that I obtain from another source. – ayhan Nov 28 '17 at 20:08
  • Oh, I guess the example would be less confusing without it. Does the advice `update_or_create` still applies? – Paulo Scardine Nov 28 '17 at 20:12
  • Let me change that to something else. The collection is quite big. If I use `update_or_create`, it will create a new query for each iteration of the loop and create a lot of overhead. For example if I use `bulk_create` instead of `create` in a loop, the time drops from minutes to milliseconds. Trying to use a single query also fails (see [this comment](https://stackoverflow.com/questions/12661253/how-to-bulk-update-with-django#comment64943030_28551658)). That's why I focus on identifying the rows to be changed hoping I can develop a better strategy based on the percentage that needs update. – ayhan Nov 28 '17 at 20:22
  • Sorry, I don't think you can get much better than `O(n)` for this problem unless you use a temporary table and some raw SQL. – Paulo Scardine Nov 28 '17 at 20:26