1

I am trying to build a model in django 1.9 that has a key, value pair (dictionary) field that also allows query set aggregation (min, mix, etc). I've tried to use the JSONField:

#models.py
from django.contrib.postgres import fields as pgfields
class Entry(models.Model):
    pass

class Scorer(models.Model):
    name = models.CharField(max_length=100)

class EntryScoreSet(models.Model):
    scorer = models.ForeignKey(Scorer)
    entry = models.ForeignKey(Entry, related_name="scorecard")
    scores = pgfields.JSONField(default={})
....
# shell test
import random
entry = Entry()
scorer,_ = Scorer.objects.get_or_create(name="scorer1")
entry.save()
for i in range(0,10):
    scores = dict(scoreA=random.random(),
                  scoreB=random.random(),
                  scoreC=random.random(),
                  )
    entry_score_set=EntryScoreSet(scores=scores, entry=entry, scorer=scorer)
    entry_score_set.save()

entry.scorecard.filter(scorer="scorer1").aggregate(Max("scores__scoreA"))

But I run into the error from this ticket (basically, aggregation is not supported).

A second option is to use a key, value pair model (similar to this answer):

class Score(models.Model):
    entry_score_set = models.ForeignKey(EntryScoreSet, db_index=True,
                                  related_name="scores")
    key = models.CharField(max_length=64, db_index=True)
    value = models.FloatField(db_index=True)

But I don't know how one would get an aggregation across a query set for a particular a key value.

How would I implement a key, value pair field in Django that allows aggregation on a query set for a particular key's value?

EDIT:

Here is a snippet that demonstrates what I want to do using pandas and the second option (key, pair model):

import django_pandas.io as djpdio
scds=Scorecard.objects.filter(
        entry__in=Entry.objects.order_by('?')[:10],
        scorer__name="scorer1")
scorecard_base=djpdio.read_frame(scds,fieldnames=["id","entry__id","scorer__name","scores__id"])
scores=djpdio.read_frame(Score.objects.filter(scorecard__in=scds),fieldnames=["id","key","value"])
scorecard_=(scorecard_base
        .merge(scores,left_on="scores__id",right_on="id")        
        .pivot_table(index="entry__id",columns="key",values="value").reset_index())
scorecard=scorecard_base.merge(scorecard_,on="entry__id")
scorecard["scoreA"].max()

Is something like this possible using django's ORM? How would the efficiency compare to using pandas pivot function?

Community
  • 1
  • 1
jmerkow
  • 1,811
  • 3
  • 20
  • 35
  • 1
    Does the `EntryScoreSet` store anything other that scores? If not, you could just have a `Score` model with a foreign key directly to `Entry`, and then aggregation is simple (happy to post an example once you clarify whether the intermediate model is required). – solarissmoke Jul 25 '16 at 05:11
  • It has an additional foreign key, entries have scores from multiple scorers. I'll update. – jmerkow Jul 25 '16 at 14:38
  • Would it be possible to separate the key, value pairs into two arrays and do this operation? i.e. a ArrayField(CharField), ArrayField(FloatField). Maybe use an F() expression to get the index of 'scoreA' – jmerkow Jul 25 '16 at 16:02

1 Answers1

1

You can do this with conditional expressions, using the second model structure you proposed (Score with a foreign key to EntryScoreSet).

from django.db.models import Case, When, Max, FloatField

entry.scorecard.all().annotate(
    max_score_key1=Max(
        Case(
            When(scores__key='key1', then='scores__value'),
            default=0,
            output_field=FloatField()
        )
    ),
    max_score_key2=Max(
        Case(
            When(scores__key='key2', then='scores__value'),
            default=0,
            output_field=FloatField()
        )
    )
)

This would add a max_score_key1 property to the resulting EntryScoreSet objects, which gives you the maximum value for all Scores that have a key of key1. Similarly max_score_key2 for Scores with key2, etc.


Edit: based on conversation in comments it looks like you want to get the maximum for each key in Score across the whole queryset. You can do that like so:

entry.scorecard.filter(scorer=some_scorer).values('scores__key')\
                      .annotate(Max('scores__value')).order_by()

This will give you output like so:

[
    {'scores__key': 'key1', 'scores__value__max': 16.0}, 
    {'scores__key': 'key2', 'scores__value__max': 15.0},
    ....
]
solarissmoke
  • 30,039
  • 14
  • 71
  • 73
  • YHere, default should be the minimum possible value? I.e. For random.random() that is zero, but for other ranges it should be whatever the lower bound is? Also, I assume entry.scorecard.all can be any queryset? – jmerkow Jul 26 '16 at 04:10
  • Yes, exactly - you can set it to something else if that makes more sense with your data. – solarissmoke Jul 26 '16 at 04:12
  • Ok thanks. I'm going to try this out and then I'll accept this. How heavy of a burden does this put on the DB? The queryset will typically have about 300 entries, with 2-20 scores per entry. – jmerkow Jul 26 '16 at 04:15
  • Can't give a qualitative answer - you'll probably have to test and see. As a general rule though this will be more efficient and faster than performing the same logic in Python. – solarissmoke Jul 26 '16 at 04:17
  • So this actually didn't work, but it helped me figure out how to make it work. To make it work, add .aggregation(max_score_key1=Max("key1"),max_score_key2=Max("key2")) (i changed the names in the annotate, and I could remove the Max from the annotate function, keys are unique in my particular QueryString). The goal was max over the entire query set. If you update, I'll accept your answer. – jmerkow Jul 26 '16 at 21:47
  • Oh, I didn't realise you want to get the max over the whole queryset... In that case things are very simple - and you don't have to specify the keys manually in the query! See my edited answer. – solarissmoke Jul 27 '16 at 04:04
  • The edit is works for me. I didn't know that values worked that way, this is very very useful. I am going to accept the answer, but one last tangential question: How would I get the argmax, i.e. the score__key with the maximum value for a single entry? – jmerkow Jul 29 '16 at 23:54
  • I would either do it with a separate query, or manipulate this data in Python, depending on how large your result set is. – solarissmoke Jul 30 '16 at 02:45
  • Oh duh. Thank you for your help. – jmerkow Jul 30 '16 at 21:35