Issue Introduction
I was sent a quite an interesting email stating that SQL Server would not shut down. Attempting to stop the service via services or the SQL Server Configuration Manager resulted in a timeout with SQL Server still running. Trying to execute the shutdown
with and without NOWAIT
T-SQL command resulted in the same, the process still running. Seems quite weird that SQL Server just refuses to shutdown!
Questions That Need Answered
1. Why isn’t SQL Server shutting down properly?
2. Is there anything that can be done to fix it?
Initial Data Collection
Since SQL Server isn’t shutting down, getting a snapshot of what’s going on is a good first step. Using SQLDUMPR is out of the question here, the trusty procdump sysinternals tool works well here. Since a memory dump is a point in time snapshot of the process, it’ll first need to be in the problematic state before a dump will be useful. There are different scenarios when you’d want to get multiple dumps with some time between each, thus is not one of those scenarios though.
Checking The Collected Data
When looking at dumps like this, where there isn’t a specific error, you really have to just take a 10,000 foot view of what the process is doing. This is true not just for SQL Server but for any other process, no matter what operating system it is running on. In this specific situation, there wasn’t anything extremely remarkable… when you look at enough SQL Server dumps you start to get a feel for what looks like it might be an issue and what just seems to be normal. Going through the stacks on the first pass, there was only one thread I noted as being interesting (below) and I think you’ll agree!
00 ntdll!NtSignalAndWaitForSingleObject 01 KERNELBASE!SignalObjectAndWait 02 sqldk!SOS_Scheduler::SwitchToThreadWorker 03 sqldk!SOS_Scheduler::Switch 04 sqldk!SOS_Scheduler::SuspendNonPreemptive 05 sqldk!WaitableBase::Wait 06 sqllang!CServiceBrokerMgr::NotifyInstanceShutdown 07 sqlmin!DBMgr::ShutdownAll 08 sqlservr!SignalProc 09 sqlmin!StartUp::InitDBMS 0a sqldk!SOS_Task::Param::Execute 0b sqldk!SOS_Scheduler::RunTask 0c sqldk!SOS_Scheduler::ProcessTasks 0d sqldk!SchedulerManager::WorkerEntryPoint 0e sqldk!SystemThread::RunWorker 0f sqldk!SystemThreadDispatcher::ProcessWorker 10 sqldk!SchedulerManager::ThreadEntryPoint 11 kernel32!BaseThreadInitThunk 12 ntdll!RtlUserThreadStart
If you look at the stack, frame 7 is calling into DBMgr::ShutdownAll
which seems to be a good place to start if you’re going to shut the instance down. Continuing to go further up the stack (toward the lower numbers, the way the stack works is the frame at the top is what is running and the frames below it are what called into that frame [not always true, for example jmp
commands]) the next frame, 6, is on sqllang!CServiceBrokerMgr::NotifyInstanceShutdown
and that really seems to look like the instance is waiting on service broker. If we continue to move up the stack, frame 5 transitions to a wait from which we then further transition up the stack going through the typical SQL OS implementation of thread scheduling and finally to a WaitForSingleObject WINAPI call. This means the thread is suspended (according to both SQL and Windows) and it will be woken up and scheduled when the waitable object is signaled (which is different depending on the object, for example on a timer will occur when the timer expires whereas waiting on a thread will signal when the thread exits).
The seemingly shutdown thread is waiting on the service broker manager… ok… so that points us in the direction of service broker. Let’s check the dump again, looking for anything dealing with service broker.. and we do find one thread that seems to be running and waiting for the next message.
00 ntdll!NtSignalAndWaitForSingleObject 01 KERNELBASE!SignalObjectAndWait 02 sqldk!SOS_Scheduler::SwitchToThreadWorker 03 sqldk!SOS_Scheduler::Switch 04 sqldk!SOS_Scheduler::SuspendNonPreemptive 05 sqldk!WaitableBase::Wait 06 sqllang!CBrokerWaitFor::ReceiveWaitFor 07 sqllang!CBrokerQueryNotification::Wait 08 sqllang!CXStmtQuery::WaitForResults 09 sqllang!CXStmtDML::XretDMLExecute 0a sqllang!CXStmtReceive::ExecuteWaitFor 0b sqllang!CXStmtReceive::XretExecute 0c sqllang!CMsqlExecContext::ExecuteStmts<1,0> 0d sqllang!CMsqlExecContext::FExecute 0e sqllang!CSQLSource::Execute 0f sqllang!CSbActivationProcTask::Execute 10 sqllang!CSbTask::TaskStart 11 sqllang!CSbTask::BeginTaskStart 12 sqldk!SOS_Task::Param::Execute 13 sqldk!SOS_Scheduler::RunTask 14 sqldk!SOS_Scheduler::ProcessTasks 15 sqldk!SchedulerManager::WorkerEntryPoint 16 sqldk!SystemThread::RunWorker 17 sqldk!SystemThreadDispatcher::ProcessWorker 18 sqldk!SchedulerManager::ThreadEntryPoint 19 kernel32!BaseThreadInitThunk 1a ntdll!RtlUserThreadStart
Notice in frame 0x11 (note that all frames are in hex) we can see a service broker task that was created and eventually picked up and executed… but from where? Moving up the stack, on frame 0xf, this gives us a sqllang!CSbActivationProcTask::Execute
which means it’s in an activated stored procedure. Alright, so we’re yet again given a clue but no actual answer. Some of you may be thinking, “But Sean, look at frame 0xe! Why can’t you just look at that and figure out the T-SQL that was running?!” and you’d be right in thinking that. However, to view the local variables one would need to have the private symbols of which I do not have as a member of the general public which is why you and I can’t just go to frame 0xe and checking the locals. You might then say, “Ok, sure, but you could do so many more things to figure out what this might be!” and sure, but that wouldn’t be a very good use of your time to be honest.
We have some information that an activated stored procedure is running, now what? I gave this information back to the person and asked what service broker activated stored procedures on queues they were using and it turns out there was just a single one… for login auditing. Another step forward, wooo! Let’s take another look at the dump and see if we see any login items.
00 ntdll!NtSignalAndWaitForSingleObject 01 KERNELBASE!SignalObjectAndWait 02 sqldk!SOS_Scheduler::SwitchToThreadWorker 03 sqldk!SOS_Scheduler::Switch 04 sqldk!SOS_Scheduler::SuspendNonPreemptive 05 sqllang!SOS_Task::Sleep 06 sqllang!CTraceEvDataQueue::Enqueue 07 sqllang!CTraceEv::Fire 08 sqllang!CTraceController::ProduceRecord 09 sqllang!CSecSpecialAudit::PopulateTraceForLoginFailed 0a sqllang!CSecSpecialAudit::AuditEvent 0b sqllang!AuditCheckSpecialAction 0c sqllang!ReportLoginFailure 0d sqllang!LoginFailedToReasonPhraseMap 0e sqllang!login 0f sqllang!process_login_finish 10 sqllang!process_login 11 sqllang!process_commands_internal 12 sqllang!process_messages 13 sqldk!SOS_Task::Param::Execute 14 sqldk!SOS_Scheduler::RunTask 15 sqldk!SOS_Scheduler::ProcessTasks 16 sqldk!SchedulerManager::WorkerEntryPoint 17 sqldk!SystemThread::RunWorker 18 sqldk!SystemThreadDispatcher::ProcessWorker 19 sqldk!SchedulerManager::ThreadEntryPoint 1a kernel32!BaseThreadInitThunk 1b ntdll!RtlUserThreadStart
There were 6 threads all with failed login processing, which seems like quite a few. If these are failed logins, then there should be an entry in the errorlog for said items. I asked the person to check the errorlog and their response was there are a large number of login failed errors, specifically stating due to the instance shutting down. Here, a large number is a very large number, hundreds per second.
Possible Mitigations and Results
We know it’s service broker causing a problem, specifically with the auditing of the login failures. I presented the following possible mitigations:
1. Check and add, if it didn’t exist, a TIMEOUT value as part of the WAITFOR statement.
2. Disable the queue before shutting SQL down.
3. Change the applications to have a backoff timer for connection attempts (don’t DOS yourself).
The response I received (ha!) was that there wasn’t a timeout but adding a timeout still caused the problem. It wasn’t feasible to change the applications to be written to do backoff, due to the fact that they were setup in a quasi micro-service setup. Disabling the queue did in fact work for them, and that’s how they decided to proceed. Seems a bit heavy handed, but service broker pretty much hasn’t been touched since it was created. It’s a very interesting concept and item to put into a database system, but not really what database systems were made to really do.
Bringing It All Together and Answering Questions
We kind of know what is going on now, and to be quite honest this looks like a possible missed edge case which should be fixed in SQL Server. While a workaround does work, this wouldn’t go well when it comes to things such as patching, automatic failover, etc., I’m just not a fan as shutting down the instance should shut down the instance. Since service broker hasn’t been touched in over a decade and the fact is that not many people use it, I wouldn’t hold my breath for any updates or fixes in this area.
Let’s answer our questions from before:
1. Service broker was firing in response to a large number of failed logins which was stopping the instance from shutting down, which was due to the auditing being done through service broker and the fact that logins will fail when the instance is shutting down coupled with the fact that there were hundreds of failures per second.
2. There’s nothing that can really be done to fix it, this would most likely take a update to SQL Server’s code to truly fix.
Closing Thoughts
Getting a dump isn’t always the thing to do, but in this case it really helped lead to the root cause. Sometimes the main issue doesn’t really show itself and you need to keep following the clues… there seems to be some random thought by everyone where the problem and solution is always very straight forward and it’s only one simple thing and were on the nose. That just doesn’t really happen in most cases. It’s perfectly fine to keep going over the data and following the leads, in fact you must do this for any sufficiently hard problem. Troubleshooting is quickly becoming a lost skill/art in the age of cloud computing, which seems to push the blame on whomever provides the service, whether or not that service is the problem. If you want to spend less time going back and forth with random support companies, dig in a little and do some basics (note that this issue and troubleshooting is _not_ the basics).
Anyway, this was a very interesting issue and I hope you all enjoyed it as well!