Weak performance of CGEventPost under GPU load

Question

We've stumbled upon a performance problem with Quartz Events, more specifically CGEventPost: during heavy GPU load CGEventPost can block. We've created a small benchmark application to demonstrate the issue. This application is just a loop that creates, post and releases events.

Below you can see the results of running the application. The first run is on an idle system. The second run is with FurMark (GPU stress test) with the dials cranked up as much as possible.

Inner is how long the inner loop takes, basically just creating, posting and releasing an event with Quartz Events.
Outer is how long our program is waiting to be woken up (a sleep). Should be close to the time we sleep, but if the system is under pressure it could be delayed.
Post is how long the event post takes.

18:58:01.683 EventPerformance[4946:707] Measurements: (outer should be close to 10)
18:58:01.684 EventPerformance[4946:707] inner (ms): 0.04, outer (ms): 11.02, CGEventPost (ms): 0.03
18:58:01.684 EventPerformance[4946:707] inner (ms): 0.04, outer (ms): 11.02, CGEventPost (ms): 0.03
18:58:01.685 EventPerformance[4946:707] inner (ms): 0.07, outer (ms): 10.26, CGEventPost (ms): 0.03
18:58:01.685 EventPerformance[4946:707] inner (ms): 0.06, outer (ms): 10.85, CGEventPost (ms): 0.05
18:58:01.686 EventPerformance[4946:707] inner (ms): 0.07, outer (ms): 10.41, CGEventPost (ms): 0.04
18:58:01.686 EventPerformance[4946:707] inner (ms): 0.04, outer (ms): 10.39, CGEventPost (ms): 0.03
18:58:01.686 EventPerformance[4946:707] inner (ms): 0.05, outer (ms): 11.02, CGEventPost (ms): 0.03
18:58:01.687 EventPerformance[4946:707] inner (ms): 0.03, outer (ms): 10.67, CGEventPost (ms): 0.03
18:58:01.687 EventPerformance[4946:707] inner (ms): 0.08, outer (ms): 10.09, CGEventPost (ms): 0.05
18:58:01.688 EventPerformance[4946:707] Averages: (outer should be close to 10)
18:58:01.688 EventPerformance[4946:707] avg inner (ms): 0.05, avg outer (ms): 10.64, avg post (ms): 0.03

Here we can see that posting the event takes about 0.03 ms on average. Also the thread seems to be woken up around 0.5ms too late. No spikes in CGEventPost.

19:02:02.150 EventPerformance[5241:707] Measurements: (outer should be close to 10)
19:02:02.151 EventPerformance[5241:707] inner (ms): 0.03, outer (ms): 10.23, CGEventPost (ms): 0.02
19:02:02.151 EventPerformance[5241:707] inner (ms): 0.02, outer (ms): 10.54, CGEventPost (ms): 0.02
19:02:02.151 EventPerformance[5241:707] inner (ms): 0.02, outer (ms): 11.01, CGEventPost (ms): 0.01
19:02:02.152 EventPerformance[5241:707] inner (ms): 0.02, outer (ms): 10.74, CGEventPost (ms): 0.01
19:02:02.152 EventPerformance[5241:707] inner (ms): 0.02, outer (ms): 10.20, CGEventPost (ms): 0.01
19:02:02.152 EventPerformance[5241:707] inner (ms): 10.35, outer (ms): 11.01, CGEventPost (ms): 10.35
19:02:02.152 EventPerformance[5241:707] inner (ms): 0.03, outer (ms): 10.02, CGEventPost (ms): 0.02
19:02:02.153 EventPerformance[5241:707] inner (ms): 58.90, outer (ms): 10.11, CGEventPost (ms): 58.90
19:02:02.153 EventPerformance[5241:707] inner (ms): 0.03, outer (ms): 10.12, CGEventPost (ms): 0.02
19:02:02.153 EventPerformance[5241:707] Averages: (outer should be close to 10)
19:02:02.371 EventPerformance[5241:707] avg inner (ms): 7.71, avg outer (ms): 10.44, avg post (ms): 7.71

When the system is under heavy GPU load, posting an event can take (spikes) milliseconds instead of microseconds. Under extreme GPU stress (< 1 FPS), this value can take seconds. CGEventPost sometimes seems to be waiting for the GPU to finish some work before returning. Our thread is still scheduled normally with no noticeable delay/spikes (outer).

Any ideas are appreciated.

I see even worse effects in windows (I know you're on osx) when submitting massive amounts of OpenCL work: all GPU drawing slows to a crawl. It's as though the GPUs have no concept of priority or how old a wavefront is. — doug65536, Jan 25 '13 at 01:47
Try profiling it with Instruments to see where the slowdown is occurring. — monoxygen, Jan 30 '13 at 01:56
What type of graphics card is in the computer you ran this on? Integrated, integrated+discrete, or discrete? It would be interesting to see if you get different behavior running on a system with a different graphics card configuration. — monoxygen, Feb 15 '13 at 03:43
Seems that both programs are competing for a locked resource. Use gdb with your code to step through CGEventPost, check what system calls are called and if it enters some mutex. Use dtruss (gdb will be harder to use there) with FurMark and check if the same systems calls are called. That could give you a hint of what's happening. — barracel, Feb 21 '13 at 08:15
OSX's timeslice is 10 ms; the only anomaly is the 60 ms spike. — tc., Mar 07 '13 at 06:30

nielsbot · Accepted Answer · 2013-03-19T23:36:11.680

I guess you're filling up the queue (underlying mach port)...

You can confirm this using the "scheduling" or "system call" instrument in Instruments. (Create a new blank document, add the instrument, then under File > Record Options... make sure "deferred mode" is checked.) This will show all thread activity in your app (when threads block, when they sleep, when they are activated, and why).

I would first try raising the thread priority (cf. man 3 PTHREAD_SCHEDPARAM) of the thread calling CGEventPost. If your thread is blocked on a lower priority thread, the kernel should temporarily elevate the priority of the blocking thread to avoid priority inversion and help your task complete earlier.

Overall I think you will have to implement a 2-thread solution, as below:

Create a queue for events you want to post. Post events to this queue from your main thread (or event posting thread), then signal a second thread (an event consumer thread you create) to walk the queue and post any outstanding events with CGEventPost.

When CGEventPost blocks, your second event-posting thread will block, but this won't block any other threads. When CGEventPost eventually unblocks, it will consume any outstanding events posted by your event consumer thread and the event consumer thread can resume posting events.

Another possibility: can you coallesce events? There are certain types of events (mouse moves?) that you could coallesce into fewer events. You will probably still run into the queue limit of CGEventPost at times, I think the 2-thread approach is probably your best bet.

Since I asked the question we've switched from CGEventPost to IOHIDPostEvent which performs much better under GPU load, but still blocks in other situations. — Dae, May 13 '13 at 19:32
Using Instruments we found that for both IOHID and Quartz we are blocking in mach_msg_trap, probably waiting to receive a message. We haven't managed to get more details (i.e. who is at the other end, and why is he blocking). — Dae, May 13 '13 at 19:32
As for the real time priority, it was accidentally removed from the test app, but we do have it in our actual project. (Makes no difference however). We've implemented coalescing and threading - thanks for the suggestion! — Dae, May 13 '13 at 19:33

Weak performance of CGEventPost under GPU load

1 Answers1