I've got a UWP (C#) app that's running in production on a remote machine (under windows 10) but it periodically crashes. My client says, somewhat arbitrarily, every 9 hours or so.
I have several .wer files from the previous crashes but did not have a minidump, the paths referenced in the event viewer entry for the crash are blank other than the WER files. See edits below for how a minidump was obtained and findings.
The exception is an access violation (0xc0000005) at exception offset 0x0004df23 in ntdll.dll
I have the full source for the application and can run it in debug for long periods without the crash.
If I use DLL Export Viewer and load the exact version of ntdll.dll (copied from the remote machine) then I can see that at relative address 0x0004dc60 is EtwNotificationRegister and at 0x0004e260 is LdrGetDllPath.
Does this mean that my crash is occurring within a line of code in EtwNotificationRegister (which in turn is invoked by something within our code; however very difficult to trace without stack/minidump) I am not sure if the layout of a dll is such that the address I have can be placed like that?
Edit 2 as per @Raymond: No. There are almost certainly other non-exported functions between EtwNotificationRegister and LdrGetDllPath. On build 17763.475, offset 4df23 is RtlpWaitOnCriticalSection, so you are probably using an uninitialized critical section or an already-deleted critical section.
Is there any way I can extract more detail about this crash? I have remote access to the computer running the app but the crash does not appear to be triggered by a particular event (e.g. we can't hit a button and cause the crash)
Using a minidump now
I am running the program in both local debug as well. I have a remote debugger to the remote process but can't seem to break or inspect threads, not sure why. Just redeployed with symbols and the debugger attaches no problem but it just skips all breakpoints :(
Our own (rather naive) local log file, originally intended for only local debugging is written with a StreamWriter.WriteLine
and immediately followed with a StreamWriter.Flush
(wrapped in a try catch since that's not thread safe) just ends at a normal event on the remote machine - there is nothing following this normal event.
We catch App_UnhandledException and write to this log so I'd have expected a stack here.
In Unexplained crashes related to ntdll.dll it is suggested that a crash from ntdll.dll is a canary in a coalmine Unexplained crashes related to ntdll.dll
Edit 1: I have configured an auto crash dump as per https://www.meziantou.net/2018/06/04/tip-automatically-create-a-crash-dump-file-on-error so if I can get it to crash again maybe I'll get a dump file next time?
Here is the detail from the WER
Version=1
EventType=MoAppCrash
EventTime=132017523132123596
ReportType=2
Consent=1
UploadTime=132017523137590717
ReportStatus=268435456
ReportIdentifier=8d467f04-4bdd-4f9e-bf26-b42d143ece1a
IntegratorReportIdentifier=b60f9ca0-4126-4262-a886-98d3844892d3
Wow64Host=34404
NsAppName=praid:App
OriginalFilename=XXXXXX.YYYYYY.exe
AppSessionGuid=00001514-0001-0004-9fe2-6df11905d501
TargetAppId=U:XXXXXX.YYYYYY_1.0.201.0_x64__b0abmt6f49vqj!App
TargetAppVer=1.0.201.0_x64_!2018//01//24:08:17:16!1194d!XXXXXX.YYYYYY.exe
BootId=4294967295
TargetAsId=1298
UserImpactVector=271582000
IsFatal=1
EtwNonCollectReason=4
Response.BucketId=2ee79f27e2e81a541d6200d746866340
Response.BucketTable=5
Response.LegacyBucketId=2117255699418735424
Response.type=4
Sig[0].Name=Package Full Name
Sig[0].Value=XXXXXX.YYYYYY_1.0.201.0_x64__b0abmt6f49vqj
Sig[1].Name=Application Name
Sig[1].Value=praid:App
Sig[2].Name=Application Version
Sig[2].Value=1.0.0.0
Sig[3].Name=Application Timestamp
Sig[3].Value=5a68410c
Sig[4].Name=Fault Module Name
Sig[4].Value=ntdll.dll
Sig[5].Name=Fault Module Version
Sig[5].Value=10.0.17763.475
Sig[6].Name=Fault Module Timestamp
Sig[6].Value=3230aa04
Sig[7].Name=Exception Code
Sig[7].Value=c0000005
Sig[8].Name=Exception Offset
Sig[8].Value=000000000004df23
DynamicSig[1].Name=OS Version
DynamicSig[1].Value=10.0.17763.2.0.0.256.48
DynamicSig[2].Name=Locale ID
DynamicSig[2].Value=5129
DynamicSig[22].Name=Additional Information 1
DynamicSig[22].Value=95b1
DynamicSig[23].Name=Additional Information 2
DynamicSig[23].Value=95b15a88b673e33a5f48839974790b1c
DynamicSig[24].Name=Additional Information 3
DynamicSig[24].Value=283d
DynamicSig[25].Name=Additional Information 4
DynamicSig[25].Value=283dea7b6b6112710c1e3f76ed84d993
Edit 3: screenshot of minidump from a crash last night. In the event log, the WER crash looks the same so this appears to be the same issue. I will see if I can load symbols etc.
Edit 4: Attempting to debug managed. Threads view shows a thread as the exception point but no call stack info.
Edit 5: Debugging native from the minidump. Looks like we have a winner. @Raymond was correct, it was RtlpWaitOnCriticalSection invoked from BluetoothLEAdvertismentWatcher::AdvertismentReceivedCallbackWorker
Native call stack as text:
Not Flagged > 8748 0 Worker Thread Win64 Thread Windows.Devices.Bluetooth.dll!(void) ntdll.dll!RtlpWaitOnCriticalSection() ntdll.dll!RtlpEnterCriticalSectionContended() ntdll.dll!RtlEnterCriticalSection() Windows.Devices.Bluetooth.dll!(void)() Windows.Devices.Bluetooth.dll!wil::ResultFromException<(void)
() Windows.Devices.Bluetooth.dll!Windows::Devices::Bluetooth::Advertisement::BluetoothLEAdvertisementWatcher::AdvertisementReceivedCallbackWorker(void) Windows.Devices.Bluetooth.dll!Windows::Devices::Bluetooth::Advertisement::BluetoothLEAdvertisementWatcher::AdvertisementReceivedThreadpoolWorkCallbackStatic(struct _TP_CALLBACK_INSTANCE *,void *,struct _TP_WORK *) ntdll.dll!TppWorkpExecuteCallback() ntdll.dll!TppWorkerThread() kernel32.dll!BaseThreadInitThunk() ntdll.dll!RtlUserThreadStart()
Edit 6: okay so now, what do I do? How can I resolve this problem? My understanding of the stack is it looks like an exception was thrown inside the callback? Is that correct? So I could put a managed try/catch in the BLE advertisment callback handler and that should (catch - for further debugging) fix it?
Edit 7: code... Here is the code we use to instantiate the wrapper and subscribe to events. The BluetoothLEAdvertisementWatcherWrapper is a delgating class (e.g. it just wraps the underlying BluetoothLEAdvertisementWatcher so it can implement an interface; it simply passes all events through and exposes properties. We do this so that we can have a different version that creates virtual events for testing)
bluetoothAdvertisementWatcher = new BluetoothLEAdvertisementWatcherWrapper();
bluetoothAdvertisementWatcher.SignalStrengthFilter.SamplingInterval = TimeSpan.Zero;
bluetoothAdvertisementWatcher.ScanningMode = BluetoothLEScanningMode.Active;
bluetoothAdvertisementWatcher.Received += Watcher_Received;
bluetoothAdvertisementWatcher.Stopped += Watcher_Stopped;
bluetoothAdvertisementWatcher.Start();
Here is the code for the wrapper; just to show it's not doing anything complex:
public class BluetoothLEAdvertisementWatcherWrapper : IBluetoothAdvertismentWatcher, IDisposable
{
private BluetoothLEAdvertisementWatcher bluetoothWatcher;
public BluetoothLEAdvertisementWatcherWrapper()
{
bluetoothWatcher = new BluetoothLEAdvertisementWatcher();
}
public BluetoothSignalStrengthFilter SignalStrengthFilter => bluetoothWatcher.SignalStrengthFilter;
public BluetoothLEScanningMode ScanningMode
{
get
{
return bluetoothWatcher.ScanningMode;
}
set
{
bluetoothWatcher.ScanningMode = value;
}
}
public event TypedEventHandler<BluetoothLEAdvertisementWatcher, BluetoothLEAdvertisementReceivedEventArgs> Received
{
add
{
bluetoothWatcher.Received += value;
}
remove
{
bluetoothWatcher.Received -= value;
}
}
public event TypedEventHandler<BluetoothLEAdvertisementWatcher, BluetoothLEAdvertisementWatcherStoppedEventArgs> Stopped
{
add
{
bluetoothWatcher.Stopped += value;
}
remove
{
bluetoothWatcher.Stopped -= value;
}
}
public BluetoothLEAdvertisementWatcherStatus Status => bluetoothWatcher.Status;
public Action<IPacketFrame, short> YieldAdvertisingPacket { get => throw new NotImplementedException(); set => throw new NotImplementedException(); }
public void Start()
{
bluetoothWatcher.Start();
}
public void Stop()
{
bluetoothWatcher.Stop();
}
public void Dispose()
{
if (bluetoothWatcher != null)
{
if (bluetoothWatcher.Status == BluetoothLEAdvertisementWatcherStatus.Started)
{
bluetoothWatcher.Stop();
}
bluetoothWatcher = null;
}
}
}
And here is the code for the Watcher_Received event handler:
private void Watcher_Received(BluetoothLEAdvertisementWatcher sender, BluetoothLEAdvertisementReceivedEventArgs args)
{
try
{
//we won't queue packets until registered
if (!ApplicationContext.Current.Details.ReceiverId.HasValue)
return;
IPacketFrame frame;
PacketFrameParseResult result = ParseFrame(args, out frame);
if (result == PacketFrameParseResult.Success)
{
ApplicationContext.Current.Details.BluetoothPacketCount++;
}
short rssi = args.RawSignalStrengthInDBm;
string message = FormatPacketForDisplay(args, args.AdvertisementType, rssi, frame, result);
if (BluetoothPacketReceived != null)
{
BluetoothPacketReceived.Invoke(this, new BluetoothPacketReceivedEventArgs(message, result, frame, rssi));
}
}
catch (Exception ex)
{
if (ex.InnerException is Exceptions.PacketFrameParseException && (ex.InnerException as Exceptions.PacketFrameParseException).Result == PacketFrameParseResult.InvalidData)
{
// noop
}
else
{
Logger.Log(LogLevel.Warning, "BLE listener caught bluetooth packet error: {0}", ex);
if (BluetoothPacketError != null)
{
BluetoothPacketError.Invoke(this, new BluetoothPacketErrorEventArgs(ex));
}
}
}
}
You can see here that the entire managed callback is wrapped in a try catch and doesn't rethrow, so I'm not sure if there's anything further I can do to prevent the native exception from bringing the application down.
Current thinking, based on this: RtlpEnterCriticalSectionContended
is it a parallel event handler, the native side is raising the handler, and it raises for a new event in the same thread while the previous handler is still executing from a previous event?
Then this is a race condition on the critical section that causes the crash?
Edit 8: To test this theory, I replaced the contents of received with a read + push to a concurrent queue, allowing the managed code to exit the event handler as quickly as possible. Then added a seperate thread reading from the concurrent queue to perform my application side processing. Initially, I thought this had resolved the issue as the application actively (listening) ran for approximately 15 hours, however it crashed again this morning with the same symptoms.
Edit 8: Following suggestions in the comments, we tried to ensure that we didn't dispose/GC the watcher after a stop prior to the receive completing. We did this by using a TaskCompletionSource to function as a promise, subscribing to the Stopped event so we could await on the completion source task which would only have a result set when the Stopped event had fired.
We also used a lock (Monitor.Enter) in both StopAsync and Received to ensure that both could not be running in parallel. This appeared to reduce the speed at which the system could process events which would make sense if the BLE packets were arriving in parallel. Updated code as follows:
if ((DateTime.Now - this.LastStartedTimestamp).TotalSeconds > 60)
{
if (this.LastStopReason != BluetoothWatcherStopReason.DeviceCharacteristicWorker)
{
Logger.Log(LogLevel.Debug, "Stopping bluetooth watcher...");
// restart watcher every 10 mins
await this.StopAsync(BluetoothWatcherStopReason.AutomaticRestart);
//start again if automatic restart
Logger.Log(LogLevel.Debug, "Starting bluetooth watcher...");
this.Start(this.testMode);
Logger.Log(LogLevel.Debug, "Started bluetooth watcher");
this.LastStartedTimestamp = DateTime.Now;
}
}
private void Watcher_Stopped(BluetoothLEAdvertisementWatcher sender, BluetoothLEAdvertisementWatcherStoppedEventArgs args)
{
string error = args.Error.ToString();
Logger.Log(LogLevel.Warning, string.Format("BLE listening stopped because {0}...", error));
LastError = args.Error;
if (BluetoothWatcherStopped != null)
{
BluetoothWatcherStopped.Invoke(sender, args);
}
}
public class ReceivedBluetoothAdvertismentPacketItem
{
public DateTime Timestamp { get; set; }
public BluetoothLEAdvertisementType Type { get; set; }
public byte[] Buffer { get; set; }
public short Rssi { get; set; }
}
ConcurrentQueue<ReceivedBluetoothAdvertismentPacketItem> BluetoothPacketsReceivedQueue = new ConcurrentQueue<ReceivedBluetoothAdvertismentPacketItem>();
private void Watcher_Received(BluetoothLEAdvertisementWatcher sender, BluetoothLEAdvertisementReceivedEventArgs args)
{
bool lockWasTaken = false;
try
{
//this prevents stop until we're exiting Received
Monitor.Enter(BluetoothWatcherEventSynchronisation, ref lockWasTaken);
if (!lockWasTaken)
{
return;
}
//we won't queue packets until registered
if (!ApplicationContext.Current.ReceiverDetails.ReceiverId.HasValue)
return;
BluetoothLEAdvertisementType type = args.AdvertisementType;
byte[] buffer = GetManufacturerData(args.Advertisement);
short rssi = args.RawSignalStrengthInDBm;
BluetoothPacketsReceivedQueue.Enqueue(new ReceivedBluetoothAdvertismentPacketItem
{
Timestamp = DateTime.UtcNow,
Type = type,
Rssi = rssi,
Buffer = buffer
});
ApplicationContext.Current.ReceiverDetails.UnprocessedQueueLength = BluetoothPacketsReceivedQueue.Count;
}
catch (Exception ex)
{
Logger.Log(LogLevel.Warning, "BLE listener caught bluetooth packet error: {0}", ex);
if (BluetoothPacketError != null)
{
BluetoothPacketError.Invoke(this, new BluetoothPacketErrorEventArgs(ex));
}
}
finally
{
if (lockWasTaken)
{
Monitor.Exit(BluetoothWatcherEventSynchronisation);
}
}
}
public BluetoothWatcherStopReason LastStopReason { get; private set; } = BluetoothWatcherStopReason.Unknown;
private object BluetoothWatcherEventSynchronisation = new object();
public Task<BluetoothWatcherStopReason> StopAsync(BluetoothWatcherStopReason reason)
{
var promise = new TaskCompletionSource<BluetoothWatcherStopReason>();
if (bluetoothAdvertisementWatcher != null)
{
LastStopReason = reason;
UpdateBluetoothStatusInReceiverModel(BluetoothLEAdvertisementWatcherStatus.Stopped); //actually stopping but we lie
bool lockWasTaken = false;
try
{
Monitor.Enter(BluetoothWatcherEventSynchronisation, ref lockWasTaken);
{
bluetoothAdvertisementWatcher.Received -= Watcher_Received;
bluetoothAdvertisementWatcher.Stopped += (sender, args) =>
{
// clean up
if (bluetoothAdvertisementWatcher != null)
{
bluetoothAdvertisementWatcher.Stopped -= Watcher_Stopped;
bluetoothAdvertisementWatcher = null;
}
//notify continuation
promise.SetResult(reason);
};
bluetoothAdvertisementWatcher.Stop();
}
}
finally
{
if (lockWasTaken)
{
Monitor.Exit(BluetoothWatcherEventSynchronisation);
}
}
}
base.Stop();
return promise.Task;
}
Following these changes, the same crash is still occuring in the Windows.Devices.Bluetooth native assembly (as per above)
Edit 9: I've removed the automatic periodic start/stop and now the app has been stable for > 36 hours without a crash. So something inside this flow is causing the crashes. We originally added that to work around an issue with the advertisment watcher just stopping after a while, so we'd like to keep it if we can fix it.
The if statement if ((DateTime.Now - this.LastStartedTimestamp).TotalSeconds > 60)
(and block) is currently commented.
I have opened a bug for windows universal here: https://wpdev.uservoice.com/forums/110705-universal-windows-platform/suggestions/37623343-bluetoothleadvertismentwatcher-advertismentreceiv