Django - when best to calculate statistics on large amounts of data

Question

I'm working on a Django application that consists of a scraper that scrapes thousands of store items (price, description, seller info) per day and a django-template frontend that allows the user to access the data and view various statistics.

For example: the user is able to click on 'Item A', and gets a detail view that lists various statistics about 'Item A' (Like linegraphs about price over time, a price distribution, etc)

The user is also able to click on reports of the individual 'scrapes' and get details about the number of items scraped, average price. Etc.

All of these statistics are currently calculated in the view itself.

This all works well when working locally, on a small development database with +/100 items. However, when in production this database will eventually consist of 1.000.000+ lines. Which leads me to wonder if calculating the statistics in the view wont lead to massive lag in the future. (Especially as I plan to extend the statistics with more complicated regression-analysis, and perhaps some nearest neighbour ML classification)

The advantage of the view based approach is that the graphs are always up to date. I could offcourse also schedule a CRONJOB to make the calculations every few hours (perhaps even on a different server). This would make accessing the information very fast, but would also mean that the information could be a few hours old.

I've never really worked with data of this scale before, and was wondering what the best practises are.

Usually it is better to calculate the statistics using an SQL-queries, not in Python itself. — Willem Van Onsem, Jan 03 '18 at 15:43

Soviut · Accepted Answer · 2018-01-19T08:18:49.430

As with anything performance-related, do some testing and profile your application. Don't get lured into the premature optimization trap.

That said, given the fact that these statistics don't change, you could perform them asynchronously each time you do a scrape. Like the scrape process itself, this calculation process should be done asynchronously, completely separate from your Django application. When the scrape happens it would write to the database directly and set some kind of status field to processing. Then kick off the calculation process which, when completed, will fill in the stats fields and set the status to complete. This way you can show your users how far along the processing chain they are.

People love feedback over immediate results and they'll tolerate considerable delays if they know they'll eventually get a result. Strand a user and they'll get frustrate more quickly than any computer can finish processing; Lead them on a journey and they'll wait for ages to hear how the story ends.

Django - when best to calculate statistics on large amounts of data

1 Answers1