8

I have a system written in python that processes large amounts of data using plug-ins written by several developers with varying levels of experience.

Basically, the application starts several worker threads, then feeds them data. Each thread determines the plugin to use for an item and asks it to process the item. A plug-in is just a python module with a specific function defined. The processing usually involves regular expressions, and should not take more than a second or so.

Occasionally, one of the plugins will take minutes to complete, pegging the CPU on 100% for the whole time. This is usually caused by a sub-optimal regular expression paired with a data item that exposes that inefficiency.

This is where things get tricky. If I have a suspicion of who the culprit is, I can examine its code and find the problem. However, sometimes I'm not so lucky.

  • I can't go single threaded. It would probably take weeks to reproduce the problem if I do.
  • Putting a timer on the plugin doesn't help, because when it freezes it takes the GIL with it, and all the other plugins also take minutes to complete.
  • (In case you were wondering, the SRE engine doesn't release the GIL).
  • As far as I can tell profiling is pretty useless when multithreading.

Short of rewriting the whole architecture into multiprocessing, any way I can find out who is eating all my CPU?

ADDED: In answer to some of the comments:

  1. Profiling multithreaded code in python is not useful because the profiler measures the total function time and not the active cpu time. Try cProfile.run('time.sleep(3)') to see what I mean. (credit to rog [last comment]).

  2. The reason that going single threaded is tricky is because only 1 item in 20,000 is causing the problem, and I don't know which one it is. Running multithreaded allows me to go through 20,000 items in about an hour, while single threaded can take much longer (there's a lot of network latency involved). There are some more complications that I'd rather not get into right now.

That said, it's not a bad idea to try to serialize the specific code that calls the plugins, so that timing of one will not affect the timing of the others. I'll try that and report back.

Community
  • 1
  • 1
itsadok
  • 28,822
  • 30
  • 126
  • 171
  • What part of the profiling info is being messed up by multithreading? – Hank Gay Jun 23 '09 at 09:43
  • Can you please explain why going single-threaded won't work? If the plug-ins never release the GIL anyway, then you won't have any parallel processing going on at all and doing it multi-threaded won't help. – Michael Kuhn Jun 23 '09 at 09:46
  • 1
    "I can't go single threaded. It would probably take weeks to reproduce the problem if I do"; Wrong. Probably going single threaded you'll have the result FASTER than threaded. – nosklo Jun 23 '09 at 11:18

4 Answers4

3

You apparently don't need multithreading, only concurrency because your threads don't share any state :

Try multiprocessing instead of multithreading

Single thread / N subprocesses. There you can time each request, since no GIL is hold.

Other possibility is to get rid of multiple execution threads and use event-based network programming (ie use twisted)

makapuf
  • 1,370
  • 1
  • 13
  • 23
  • The other advantage of multiprocessing is that you'll be able to 'see' the process, and get the pid. – monkut Jun 24 '09 at 07:19
0

As you said, because of the GIL it is impossible within the same process.

I recommend to start a second monitor process, which listens for life beats from another thread in your original app. Once that time beat is missing for a specified amount of time, the monitor can kill your app and restart it.

wr.
  • 2,841
  • 1
  • 23
  • 27
0

If would suggest as you have control over framework disable all but one plugin and see. Basically if you have P1, P2...Pn plugins run N process and disable P1 in first, P2 in second and so on

it would be much faster as compared to your multithreaded run, as no GIL blocking and you will come to know sooner which plugin is the culprit.

Anurag Uniyal
  • 85,954
  • 40
  • 175
  • 219
0

I'd still look at nosklo's suggestion. You could profile on a single thread to find the item, and get the dump at your very long run an possibly see the culprit. Yeah, I know it's 20,000 items and will take a long time, but sometimes you just got to suck it up and find the darn thing to convince yourself the problem is caught and taken care of. Run the script, and go work on something else constructive. Come back and analyze results. That's what separates the men from the boys sometimes;-)

Or/And, add logging information that tracks the time to execute each item as it is processed from each plugin. Look at the log data at the end of your program being run, and see which one took an awful long time to run compared to the others.

Jay Atkinson
  • 3,279
  • 2
  • 27
  • 41