4

We've a Django, Postgresql database that contains objects with:

object_date = models.DateTimeField()

as a field.

We need to count the objects by hour per day, so we need to remove some of the extra time data, for example: minutes, seconds and microseconds.

We can remove the extra time data in python:

query = MyModel.objects.values('object_date')
data = [tweet['tweet_date'].replace(minute=0, second=0, microsecond=0) for tweet in query

Which leaves us with a list containing the date and hour.

My Question: Is there a better, faster, cleaner way to do this in the query itself?

a_Fraley
  • 342
  • 1
  • 4
  • 12
  • 1
    This might help, it looks like it's someone doing something similar: http://stackoverflow.com/questions/30270371/hourly-grouping-of-rows-using-django – Peter DeGlopper Jul 08 '16 at 16:32

2 Answers2

3

If you simply want to obtain the dates without the time data, you can use extra to declare calculated fields:

query = MyModel.objects
    .extra(select={
        'object_date_group': 'CAST(object_date AS DATE)',
        'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
    })
    .values('object_date_group', 'object_hour_group')

You don't gain too much from just that, though; the database is now sending you even more data.

However, with these additional fields, you can use aggregation to instantly get the counts you were looking for, by adding one line:

query = MyModel.objects
    .extra(select={
        'object_date_group': 'CAST(object_date AS DATE)',
        'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
    })
    .values('object_date_group', 'object_hour_group')
    .annotate(count=Count('*'))

Alternatively, you could use any valid SQL to combine what I made two fields into one field, by formatting it into a string, for example. The nice thing about doing that, is that you can then use the tuples to construct a Counter for convenient querying (use values_list()).

This query will certainly be more efficient than doing the counting in Python. For a background job that may not be so important, however.

One downside is that this code is not portable; for one, it does not work on SQLite, which you may still be using for testing purposes. In that case, you might save yourself the trouble and write a raw query right away, which will be just as unportable but more readable.

Update

As of 1.10 it is possible to perform this query nicely using expressions, thanks to the addition of TruncHour. Here's a suggestion for how the solution could look:

from collections import Counter
from django.db.models import Count
from django.db.models.functions import TruncHour

counts_by_group = Counter(dict(
    MyModel.objects
        .annotate(object_group=TruncHour('object_date'))
        .values_list('object_group')
        .annotate(count=Count('object_group'))
)) # query with counts_by_group[datetime.datetime(year, month, day, hour)]

It's elegant, efficient and portable. :)

Thijs van Dien
  • 6,516
  • 1
  • 29
  • 48
  • That looks pretty good, but I need to count the hours for each day, so I need the date and just the hour. Mostly just curious if by somehow doing this in the query, would this be faster than doing it in vanilla python in a Celery Task or the view and template. – a_Fraley Jul 08 '16 at 18:45
  • 1
    @a_Fraley Answer much improved. :) – Thijs van Dien Jul 08 '16 at 21:19
  • The first bit of code worked, since I use a Postgresql database even in testing. I am going to try out this second set. I was reading over the 1.10 docs. Interesting. :) Thanks. – a_Fraley Jul 08 '16 at 22:29
2
count = len(MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)))

or

count = MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)).count()

Assuming I understand what you're asking for, this returns the number of objects that have a date within a specific time range. Set the range to be from the beginning of the hour until the end of the hour and you will return all objects created in that hour. Count() or len() can be used depending on the desired use. For more information on that check out https://docs.djangoproject.com/en/1.9/ref/models/querysets/#count

Written
  • 635
  • 4
  • 12
  • That seems close to what we are looking for. We have 9 days worth of objects, 17,685 objects to count by hour, though, so I was thinking to strip everything from `DateTimeField()` except the date itself and the hour, then count these objects in the database itself using a query `count()`; this task needs to be as fast as possible. – a_Fraley Jul 08 '16 at 17:56
  • 1
    If you want to do it for one specific hour rather than grouping and annotating with `.count()`, an `hour` filter also works - https://docs.djangoproject.com/en/1.9/ref/models/querysets/#hour . Though limiting it to the right date would require another filter clause (so you're counting, say, 9-10 AM on one specific day rather than on all days) so there's not much practical difference between the two options. – Peter DeGlopper Jul 08 '16 at 18:06
  • Sorry, my answer should have said filter not values before which I have edited. Anyway, I don't really understand how stripping the data will aid the speed of this though. The ``count()`` can replace ``len()``, either may be faster depending on what you're using it for. For more optimization tips, there might be a faster way to filter through and cache your queries to make them smaller and smaller as you go on, so I'd recommend doing some research on that as well – Written Jul 08 '16 at 18:08