2

I want to capture a stacktrace of an application which sometimes stops responding for few minutes.

When the application stops responding, the windows desktop also stops responding to mouse clicks, although some other already running applications are working fine at that time (for example windbg works fine, ProcessExplorer refreshes its screen, but does not respond to mouse events). While the application is non responsive, it is actually taking about 80% of one CPU core. That is why I would like to get a stacktrace.

The misbehaving application usually takes about 2-3 minutes to do its strange job or if Ctrl+Esc is pressed it becomes responsive immediately (and the start menu opens of course...)

I have WinDbg attached to the misbehaving application and when I issue the Break command, the break-in does not happen until the application starts to respond again.

From what I understand the break-in actually creates a remote thread which pretty soon calls DbgBreakPoint.

What could be preventing debugger's thread from executing?

EDIT: First of all thanks for your help!

I was also thinking that this might be caused by a bad device driver or something that installs a system wide hook somewhere.

I was thinking to enable kernel debugging and get a stack trace from the kernel for the offending thread or enable manual bluescreen trigger to produce a dump and look at that afterwards.

Process Explorer and Process Monitor does not reveal anything interesting. They also become unusable when the bug is triggered (updating their windows but not responsive to mouse or keyboard).

EDIT2: Background info: App uses QT, OpenGL and also DirectSound and runs on Windows 7 SP1 x64 I am currently suspecting something with the graphics part.

The strange thing is that if a system-wide lock is taken (like GDI Lock), this would prevent drawing of other Windows, but that does not happen. WinDbg on same machine works fine. ProcessExplorer updates but does not receive mouse clicks, Desktop updates but no mouse clicks.

I currently have a kernel debugger attached...

EDIT3 ETW was most useful for debugging. It turns out that Qt's main event processing loop goes crazy. PeekMessage and MsgWaitForMultipleObjectsEx (with 0 timeout) gets called in a tight loop. That is where the high CPU usage comes from. It looks like the App is generating/getting loads of messages at that time. But it is not easy to see what the messages are (or I don't know how to access function parameters in ETW). Using a debugger also does not help much but, with a breakpoint in the QT's event loop leads me to believe that WM_TIMER messages are the culprit.

Jaka
  • 1,205
  • 12
  • 19
  • what is windows version ? and this only on this version or on all ? – RbMm Feb 01 '17 at 08:59
  • Windows 7 sp1 x64, Did not test on other versions since some components are certified only on Windows 7. The problem is also not readily reproducible. It may happen after of 2hrs of usage (which is consistent with a dodgy driver theory). – Jaka Feb 01 '17 at 10:23
  • have you tried WPA/ETW to see where you got the hang? – magicandre1981 Feb 04 '17 at 07:25
  • @magicandre1981 Yes, see the updated question. – Jaka Feb 05 '17 at 21:21
  • you can't see function parameters in ETW. you nee to write a ETW provider and raise events that include the function parameters in the event payload data in the QT loop (so requires building QT on your own) and you also need to capture the events by this user mode provider like I show here for .net events: http://stackoverflow.com/a/30289933/1466046 – magicandre1981 Feb 06 '17 at 16:09
  • That is too much hassle :) I'll make a tool to hook PeekMessageW and inspect the MSG structure just before it returns. Or maybe I'll use https://www.frida.re/ instead... – Jaka Feb 06 '17 at 22:05
  • do you see which messages QT sends too heavily? – magicandre1981 Feb 10 '17 at 16:25
  • It was a runaway timer. – Jaka Feb 27 '17 at 08:03

2 Answers2

2

Given that the desktop also misbehaves during this time, it sounds like your app isn't necessarily misbehaving but merely aggravating a bug elsewhere (e.g., in a device driver or some crummy anti-malware code that has injected itself into other processes). Stack traces from your app may or may not be very revealing.

If the problem is easily reproducible, I'd set a breakpoint somewhere in the "middle" of the app and see if the problem happens before or after that. Then move the breakpoint until you find the last instruction your app executes before things go bonkers. Figuring out what your app does that triggers this behavior may give a clue as what's going on.

Another option is to try using some system-wide debugging tools. First, I'd peak in the Event Viewer to see if there are suspicious error or warning events posting in proximity to the moment the machine goes haywire. Then I'd try a tool like Sysinternal's Process Monitor or Process Explorer to get a better view of what's happening. You might also try ETW to capture a system-wide trace of what's happening on the system that you can study after the fact. (ETW can be hard to use, so check out Bruce Dawson's UIforETW.)

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • Thanks for reminding me of ETW. That will be easier to manage than kernel debugger :) (No firewire adapters etc....) – Jaka Feb 01 '17 at 08:58
  • Also if we forget about this specific problem... I would be interested to know what kind of system-wide lock (or something similar) would prevent the remote thread that debugger creates from running? Any ideas? – Jaka Feb 01 '17 at 09:00
  • 1
    @Jaka - this is may be related to some bug in win32k.sys subsystem. say acquire some system wide resource, without APC disabling, and thread which acquire this resource suspended. as result all gui threads can hang, when try acquire this resource too. many years ago i saw like this on xp - if suspend gui thread, when it in focus receive process – RbMm Feb 01 '17 at 09:11
1

Use ETW to find the cause. Install the Windows Performance Toolkit (part of the Win10 v1511 SDK: https://go.microsoft.com/fwlink/p/?LinkID=698771 which is the last version that works in Win7), run WPRUI.exe, select CPU Usage and click on Start.

After you captured the hang, click on Save. Wait until WPRUI is finished, open the ETL in WPA, setup and load debug symbols in WPA.

Drag & Drop the CPU Usage (Precise) graph to analyse pane and look for WAIT (µs) max for your process to see that long hang and expand the stack to see where it happens.

magicandre1981
  • 27,895
  • 5
  • 86
  • 127