2

I've run into a specific problem and thought of an solution. But since the solution is pretty involved, I was wondering if others have encountered something similar and could comment on best practises or propose alternatives.

The problem is as follows: I have a webapp written in Django which has some screen in which data from multiple tables is collected, grouped and aggregated in time intervals. It's basically a big excel like matrix where we have data aggregated in time intervals on one axis, against resources for the aggregated data per interval on the other axis. It involves many inner and left joins to gather all data, and because of the "report" like character of the presented data, I use raw sql to query everything together.

The problem is that multiple users can concurrently view & edit data in these intervals. They can also edit data on finer or coarser granularities than other users working with the same data, but in sub/overlapping intervals. Currently, when a user edits some data, a django request is fired, the data is altered, the affected intervals are aggregated & grouped again and presented back. But because of the volatile nature of this data, other users might have changed something before them. Also grouping/aggregating and rerendering the table each time is a very heavy operation (depending on amount of data and range of the intervals). This gets worse with concurrent users editting..

My proposed solution: It's clear a http request/response mechanism is not really ideal for this kind of thing; The grouping/aggregation is pretty heavyweight, not ideal to do this per request, the concurrency would ideally be channeled amongst users, and feedback should be realtime like googledocs instead of full page refreshes.

I was thinking about making a daemon process which reads in flat data of interestfrom the dbms on request and caches this in memory. All changes to the data would then occur in memory with a write-through to the dbms. This daemon channels access to the data through a lock, so the daemon can handle which users can overwrite others changes.

The flat data is aggregated and grouped using python code and only the slices required by the user are returned; user/daemon communication would run over websockets. The daemon would provide a subscriber/publisher channel, where users interested in specific slices of data are notified when something changes. This daemon could be implemented using a framework like twisted. But I'm not sure an event driven approach would work here, as we want to "channel" all incomming requests... Maybe these should be put in a queue and be run in a seperate thread? Would it be better to have twisted run in a thread next to my scheduler, or should the twisted main loop spin off a thread that works on this queue? My understanding is that threading works best for IO, and python heavy code basically blocks other threads. I have both (websockets/dbms and processing data), would that work?

Has anyone done something similar before?

Thanks in advance!

Karl

Martijnh
  • 333
  • 2
  • 10

2 Answers2

3

The scheme Google implemented for the now abandoned Wave product's concurrent editing features is documented, http://www.waveprotocol.org/whitepapers/operational-transform. This aspect of Wave seemed like a success, even though Wave itself was quickly abandoned.

As far as the questions you asked about implementing your proposed scheme:

  1. An event driven system is perfectly capable of implementing this idea. Being event driven is a way to organize your code. It doesn't prevent you from implementing any particular functionality.
  2. Threading doesn't work best for very much, particularly in Python.
    1. It has significant disadvantages for CPU-bound work, since CPython only runs a single Python thread at a time (regardless of available hardware resources). This means a multi-threaded CPU-bound Python program is typically no faster, or even slower, than the single-threaded equivalent.
    2. For IO, this shortcoming is less of a limitation, because IO does not involve running Python code on CPython (the IO APIs are all implemented in C). This means you can do IO in multiple threads concurrently, so threading is potentially a benefit. However, doing IO concurrently in a single thread is exactly what Twisted is for. Threading offers no benefits over doing the IO in a single thread, as long as you're doing the IO non-blockingly (or perhaps asychronously).
  3. Hello world.
Jean-Paul Calderone
  • 47,755
  • 6
  • 94
  • 122
  • Thanks for the link to OT! Besides an interesting read, I'm not sure if it's really applicable in my case... The transforms work for something like an text editor, but when the collaborative data is a bit more complex, conflict situations seem to become very application domain specific. As for nr 2, what to do in a case where you have heavy processing *and* async IO (handling incomming/outgoing connections)? Split up into seperate processes again? – Martijnh Aug 26 '12 at 00:00
2

I tried something similar and you might be interested in the solution. Here is my question:

python Socket.IO client for sending broadcast messages to TornadIO2 server

And this is the answer:

https://stackoverflow.com/a/10950702/675065

He also wrote a blog post about the solution:

http://blog.y3xz.com/blog/2012/06/08/a-modern-python-stack-for-a-real-time-web-application/

The software stack consists of:

I implemented this myself and it works like a charm.

Community
  • 1
  • 1
Alp
  • 29,274
  • 27
  • 120
  • 198
  • Thanks! I'll have a look at these libraries.Tornado seems like a better fit than twisted for this case.. which would cover the communication part of my problem. – Martijnh Aug 26 '12 at 00:06
  • By the way, additionally i use [uwsgi](http://projects.unbit.it/uwsgi/) to serve the django application and [nginx](http://nginx.org/) as reverse proxy to serve all static files and redirect requests to uwsgi. – Alp Aug 26 '12 at 10:34
  • The last comment is outdated. uwsgi or gunicorn is not needed. Tornado is capable of serving wsgi applications so that there is no need for another tool in your software stack. – Alp Jun 18 '15 at 16:06