This issue was a fun one, brought to me due to a single replica in an availability group constantly having an issue where it seemed that SQL Server just stopped processing certain requests. If you’re reading this and thinking to yourself, “Just one replica in the topology? What’s different with that one replica?” then give yourself a pat on the back, you’re the right type of person. If you read that and thought, “I wonder what bug was in SQL this time?” then please stop using technology.
The symptoms of this issue were interesting, every so often the instance would just kind of get “stuck” – at least that is how it was described to me. Nothing would work, cancelling queries, attempting to kill queries, submitting anything new, nothing seemed to really do anything except restarting the service. Once the service was restarted, the instance (and AG) would hum along nicely… until some random time later when submitting different queries would just grind to a halt. Fun.
Since it was a query that seemed to be just waiting forever, no matter what, it’d make sense to get the stack for that thread, which is listed below.
ntdll!ZwWaitForSingleObject ntdll!ldrpDrainWorkQueue ntdll!ldrpInitializeThread ntdll!_LdrpInitialize ntdll!LdrpInitializeInternal ntdll!LdrInitializeThunk
Sure looks a whole lot like not SQL Server related items, which is correct, it’s plumbing in Windows dealing with creation of a new thread. In this case it has to do with the loader lock, which had some enhancements done to it in Windows 10+ for parallel loading of libraries. It’s waiting for all outstanding work to be completed in the loader infrastructure before continuing, so what in this process is holding the loader lock (which is process wide) or otherwise not allowing the work to drain properly? Time for a memory dump.
Looking at the dump, there is another thread which is attempting to load a library.
ntdll!ZwWaitForSingleObject ntdll!ldrpDrainWorkQueue ntdll!LdrpLoadDllInternal ntdll!LdrLoadDLL KERNELBASE!LoadLibraryExW xpsqlbot!SqlGenericLoadLibrary xpsqlbot!SsLoadLibraryGeneric xpsqlbot!SsLoadLibrary xpsqlbot!xp_qv
SQL Server is attempting to load a library for use, sure, makes sense. Seems pretty normal… but why is this also hanging up on draining the queue? Even if the loader lock was somehow orphaned, the existing threads should continue to execute without an issue – assuming no loading of new libraries or resources. It doesn’t seem as though this is the actual problem, but another symptom of a different issue. Gotta keep looking.
Continuing through the dump, this magnificent beast of a stack showed up. I’ll let you try to figure out what it means.
win32u!ZwUserWaitMessage user32!DialogBox2 user32!InternalDialogBox user32!SoftModalMessageBox user32!MessageBoxWorker user32!MessageBoxTimeoutW user32!MessageBoxTimeoutA user32!MessageBoxExA user32!MessageBoxA
If you said, hey this shows a modal message box (gui) element, you’d be correct. Now, why would a headless service (SQL Server) which typically runs as a service attempt to show a message box (GUI element)? That seems like a huge bug, because a modal message box is going to stop everything and wait for a response on that thread from the user. If nothing can continue processing on that thread and it’s holding a lock, no one else is going to be able to get past it… and since there is no GUI for SQL Server, no one is going to ever be able to respond to the message box. Ok, so seems like we need to figure out who is trying to do something to stupid, so against everything you’d do for a headless service, that it’s freezing up the rest of SQL Server… which component is causing all these “bugs”. Let’s continue down the stack for this thread and see if we get any hints.
OraOLEDB19
That’s right. The Oracle OLEDB driver loaded in process is attempting to show a message box and completely hosing the instance of SQL Server. So much for this huge magical SQL Server bug…. the crappy Oracle drivers strike again!
One wonders how frequently that modal MessageBox has ever been visible to a person. Seems to me a *driver* should never be asking a question and waiting for user input.
100%. It’s not the first time the Oracle driver attempted to show a message box either :/ It really bothers me that the .net drivers will show authentication windows also.
Why does SQL Server allow the Oracle drivers to sabotage it? Sounds like a SQL Server Problemâ„¢ to me.
Yeah, I do wish it wasn’t loaded in-process :/