0

I have a model equivalent to this:

class SomeModel(models.Model):
    reference = models.CharField(max_length=24, db_index=True)
    start_date = models.DateTimeField()
    is_active = models.BooleanField()

For a given query date and is_active status, I would like to return a queryset of the period from the query date back to when is_active changes.

e.g. If I had a list:

| reference | start_date | is_active |
|-----------|------------|-----------|
| j39djkj   | 2010-12-01 | True      |
| dkh39d9   | 2010-11-01 | True      |
| j39djkj   | 2010-09-01 | True      |
| dkh39d9   | 2010-08-01 | False     |
| j39djkj   | 2010-05-01 | False     |
| dkh39d9   | 2010-03-01 | True      |
| j39djkj   | 2010-02-01 | True      |
| j39djkj   | 2010-01-01 | True      |

I want to filter by is_active = True when start_date__lte=2010-12-31 and get a queryset back like this:

| reference | start_date | is_active | active_status_days | 
|-----------|------------|-----------|--------------------|
| j39djkj   | 2010-09-01 | True      | 122                |
| dkh39d9   | 2010-11-01 | True      | 61                 |

The number of records is in the tens of millions with about 1 million individual references. Looking for something that is reasonably performant.

Running Django 2.2 on Postgresql with Django Rest Framework.

Thanks and appreciated if you can help...

Turukawa
  • 155
  • 2
  • 11

1 Answers1

0

Answering my own question using a number of different approaches.

Firstly, apply an annotation with a day-count to some specific date. This uses sample code from Oli and their answer here

from django.db.models import Q, Func #  see https://stackoverflow.com/a/38477716/295606

# include the pk for ordering in case there are multiple rows on the same date
q = SomeModel.objects.order_by("reference", "-start_date", "-pk")
# https://stackoverflow.com/a/38477716/295606
target_date = datetime(2010, 12, 31)
q = q.annotate(status_duration = DiffDays(CastDate(target_date ) - CastDate(F("start_date"))) + 1)
f = Q(start_date__lte = target_date)

I haven't included the DiffDays and CastDate functions because they're well-explained by Oli in that answer.

I would prefer to do the next bit as part of the query, but I couldn't figure it out. Instead I import the queryset into pandas and continue from there. I use the Django-Pandas read_frame module to process the query. Pandas shift is the best way to compare successive rows.

from django_pandas.io import read_frame

# Convert the data into a dataframe for further processing
df = read_frame(q.filter(f),
                fieldnames=["reference", "start_date", "is_active", "status_duration"],
                verbose=False)
# Groupby is too slow here, so use np.where and check if this row reference == next row
# If it does, if the status changes on the next row, then keep this row = True
# If the reference on the next row is not the same, then automatically set keep = True
df.loc[:, "keep"] = np.where(df.reference == df.reference.shift(-1),
                             df.is_active != df.is_active.shift(-1), 
                             True)
# Then groupby and get the first row of each group
df = df[df.keep].groupby("reference").head(1).reset_index(drop=True)

The major benefit here is I can process both True and False is_active status simultaneously and get them in a single table.

It's reasonably quick, processing a query of 100,000 rows in about a second on my pc.

Turukawa
  • 155
  • 2
  • 11