2

I am working on a web backend that frequently grabs realtime market data from the web, and puts the data in a MySQL database.

Currently I have my main thread push tasks into a Queue object. I then have about 20 threads that read from that queue, and if a task is available, they execute it.

Unfortunately, I am running into performance issues, and after doing a lot of research, I can't make up my mind.

As I see it, I have 3 options: Should I take a distributed task approach with something like Celery? Should I switch to JPython or IronPython to avoid the GIL issues? Or should I simply spawn different processes instead of threads using processing? If I go for the latter, how many processes is a good amount? What is a good multi process producer / consumer design?

Thanks!

user1094786
  • 6,402
  • 7
  • 29
  • 42

2 Answers2

1

First, profile your code to determine what is bottlenecking your performance.

If each of your threads are frequently writing to your MySQL database, the problem may be disk I/O, in which case you should consider using an in-memory database and periodically write it to disk.

If you discover that CPU performance is the limiting factor, then consider using the multiprocessing module instead of the threading module. Use a multiprocessing.Queue object to push your tasks. Also make sure that your tasks are big enough to keep each core busy for a while, so that the granularity of communication doesn't kill performance. If you are currently using threading, then switching to multiprocessing would be the easiest way forward for now.

Brendan Wood
  • 6,220
  • 3
  • 30
  • 28
  • to be honest, I am not sure how to check how my python process is utilizing its resources in terms of disk I/O and so forth. I have about 20,000-30,000 inserts overall into the DB every minute. is that considered a lot? an average row length is about 60-70 bytes... – user1094786 May 24 '12 at 19:01
  • Disclaimer: database stuff is not my strength. That said, assuming 30k inserts of 70B per minute, you're only writing about 2MB per minute, so disk throughput is likely not the problem. However, if each insert is one disk transaction, then disk seek latency could definitely be a bottleneck. Is there any way you can do your database inserts as a batch? You should profile your code to see what operations are taking the longest. Have a look at this answer for help with that: http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script – Brendan Wood May 24 '12 at 20:16
1

Maybe you should use an event-driven approach, and use an event-driven oriented frameworks like twisted(python) or node.js(javascript), for example this frameworks make use of the UNIX domain sockets, so your consumer listens at some port, and your event generator object pushes all the info to the consumer, so your consumer don't have to check every time to see if there's something in the queue.

Paco Valdez
  • 1,915
  • 14
  • 26
  • thanks - I've definitely considered twisted before, but I just don't see how this will work - what exactly will the consumers be receiving and how? – user1094786 May 24 '12 at 19:03