I've run into a specific problem and thought of an solution. But since the solution is pretty involved, I was wondering if others have encountered something similar and could comment on best practises or propose alternatives.
The problem is as follows: I have a webapp written in Django which has some screen in which data from multiple tables is collected, grouped and aggregated in time intervals. It's basically a big excel like matrix where we have data aggregated in time intervals on one axis, against resources for the aggregated data per interval on the other axis. It involves many inner and left joins to gather all data, and because of the "report" like character of the presented data, I use raw sql to query everything together.
The problem is that multiple users can concurrently view & edit data in these intervals. They can also edit data on finer or coarser granularities than other users working with the same data, but in sub/overlapping intervals. Currently, when a user edits some data, a django request is fired, the data is altered, the affected intervals are aggregated & grouped again and presented back. But because of the volatile nature of this data, other users might have changed something before them. Also grouping/aggregating and rerendering the table each time is a very heavy operation (depending on amount of data and range of the intervals). This gets worse with concurrent users editting..
My proposed solution: It's clear a http request/response mechanism is not really ideal for this kind of thing; The grouping/aggregation is pretty heavyweight, not ideal to do this per request, the concurrency would ideally be channeled amongst users, and feedback should be realtime like googledocs instead of full page refreshes.
I was thinking about making a daemon process which reads in flat data of interestfrom the dbms on request and caches this in memory. All changes to the data would then occur in memory with a write-through to the dbms. This daemon channels access to the data through a lock, so the daemon can handle which users can overwrite others changes.
The flat data is aggregated and grouped using python code and only the slices required by the user are returned; user/daemon communication would run over websockets. The daemon would provide a subscriber/publisher channel, where users interested in specific slices of data are notified when something changes. This daemon could be implemented using a framework like twisted. But I'm not sure an event driven approach would work here, as we want to "channel" all incomming requests... Maybe these should be put in a queue and be run in a seperate thread? Would it be better to have twisted run in a thread next to my scheduler, or should the twisted main loop spin off a thread that works on this queue? My understanding is that threading works best for IO, and python heavy code basically blocks other threads. I have both (websockets/dbms and processing data), would that work?
Has anyone done something similar before?
Thanks in advance!
Karl