3

I asked a question on SO yesterday titled Deciding and implementing a trending algorithm in Django. A lot of people suggested a simple thing like averages (exponential, weighted, etc.) I have a model called Book and another called Readers:

class Book(models.Model):
    name = models.charField()

class Reader(models.Model):
    date = models.DateField()
    book = models.ForeignKey(Book)
    reader_count = models.PostiveIntegerField()

A simple structure. New books are added every day and the reader count for each book is added every day too. i.e. a book will have a reader count for the day for each day, multiple records.

I need to calculated the averages for the Books for the current week, the current month and the current year. Apart from the current data, I'd like to preserve the historical data too.

If i tried querying this kind of data from the DB, it would take a big hit. Wouldn't it. Besides, I'm trying to implement this system using simple averages to begin with but later on, I'd like to have the flexibility of changing my computational method. I have two options -

  • one, where I can partially update the data in another table which stores the computed data every time a new Reader record is added.

  • two, where I could rebuild the aggregated data every night through a script for the current day/week/month.

Here's some sample data and the results.

Book  Date        Count
----  ----------  -----
AAAA  01.01.2012    10
AAAA  02.01.2012    20
AAAA  03.01.2012    30
AAAA  04.01.2012    30
AAAA  05.01.2012    40
AAAA  06.01.2012    10
AAAA  07.01.2012    25
AAAA  08.01.2012    15
AAAA  09.01.2012    10

The reader count average for Week #1 is: 23.5. The reader count average for Week #2 (which isn this case would be the current week) is: 12.5 ..and for the current month and year would be 21.1

HTH.

In order to give any of these a shot, I'd like to build a system to store the data. I need to store the averages on a daily, weekly, and monthly basis. However I'm very lost with what kind of a table structure I should implement? I'd like to not re-invent the wheel if possible so if any of you know about any packages that allow me to accomplish this, it would be great.

Thanks.

Community
  • 1
  • 1
Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382

2 Answers2

3

Postgres is very good at doing these kinds of calculations at the same time as your other traffic, so don't worry too much about load (as long as you run this kind of batch job outside of the request-response cycle).

One thing that you may look at doing is splitting this kind of work into small cache-able units. i.e. each average of a month is really an average of the past 4 weeks, each average of a year is an average of the last 12 months, and this is all only done on a per-book basis anyway, so why not do small subsets of the work in-request.

from django.core.cache import cache
from datetime import timedelta

def cached(key, expire)
    def wrapped(f):
        def func(*args, **kwargs):
            result = cache.get(key%args%kwargs)
            if result is None:
                result = f(*args, **kwargs)
                cache.set(key%args%kwargs, result, expire)
            return result
        return func
    return wrapped

@cached("book:%s:avg:week:%s", 3600*24) #cache for a day, rolling results!
def book_read_week_average(book_id, week_start):
    week_end = week_start + timedelta(days=7)
    return Reader.objects.filter(book_id=book_id, date_gte=week_start, date_lt=week_end) \
                         .aggregate(Avg('count'))['count_avg']

@cached("book:%s:avg:month:%s", 3600*24) #cache for a day for rolling results
def book_read_month_average(book_id, month_start):
    month_end = month_start + timedelta(days=31)
    return Reader.objects.filter(book_id=book_id, date_gte=month_start, date_lt=month_end) \
                         .aggregate(Avg('count'))['count_avg']

@cached("author:%s:avg:month:%s", 3600*24)
def author_read_month_average(author_id, month_start):
    return sum(book_read_month_average( book.id )
               for book in Book.objects.filter(author_id=author_id) )

using function composition and cached functions, you generate only the data you need, and only when you need it. You can also store this info in redis instead of the django cache and take advantage of atomic increments on read counts, allow real-time reading statistics.

Thomas
  • 11,757
  • 4
  • 41
  • 57
1

I started django-cube to handle exactly this kind of problems (See OLAP cube on wikipedia). However, by lack of time I didn't manage to get a proper, efficient version of it ... so unfortunately it won't do in your case.

Since many people have been asking me about django-cube, I started the development over again, on a fresh repository on github.

Now, with much more experience on the problem than 2 years back (when I threw my first try), I have quite a good idea of what I have to do, and what the API should look like ; and I'll be developing it slowly when I have free time. So stay tuned, and of course any help on that project would be very welcome.

sebpiq
  • 7,540
  • 9
  • 52
  • 69