What’s an IOCP?
IO Completion Ports are a set of Windows APIs which allow for efficient, fast, multithreaded asynchronous IO. Great, that pretty much tells you nothing.
SQL Server uses IO Completion Ports not for disk-based IO but for general network IO when it comes into SQL Server for TDS level items. This means it’s used for things such as connecting to an instance of SQL Server, sending batch and rpc information, etc., and is used to properly take actions on the incoming items. These actions should be extremely short and quick, the name of the game is low latency and high throughput which means not doing things like reading or writing from disk, allocating memory, calling functions that may block, etc., to keep things flowing.
What does Non-Yielding Mean?
If you read the post about Non-yielding Scheduler then you’ll be familiar with the concept. One of the main differences is that the IOCP thread non-yield time is 15 seconds, which is an eternity to wait when it comes to the low overhead fast throughput of the IOPC design. Much like with the scheduler, there is a background task that checks forward progress for these and will cause a dump if it is non-yielding.
The causes are also much the same, so while it could be an issue within SQL Server it can also be due to configuration and system setup, system performance (or lack thereof), 3rd party software such as the ever so very intrusive host protection/antivirus software, among a litany of other items.
Investigating The Dump
This is going to sound like a broken record, however, investigating the dumps are the same as the non-yielding scheduler. The threads running the IOCP are still considered SOS threads, so they will start the same as the others. The key piece of information here is there should be 1 IOCP thread per numa node and the function to look for is: sqldk!SOS_Node::ListenOnIOCompletionPort
Finding them is the same, you know the function, find it in the dump:
0:025> !findstack sqldk!SOS_Node::ListenOnIOCompletionPort Thread 016, 1 frame(s) match * 02 0000011cbaa56160 00007ff9291e64dd sqldk!SOS_Node::ListenOnIOCompletionPort Thread 035, 1 frame(s) match * 11 0000011cbaac8160 00007ff9291e64dd sqldk!SOS_Node::ListenOnIOCompletionPort Thread 052, 1 frame(s) match * 02 0000011cbab2e160 00007ff9291e64dd sqldk!SOS_Node::ListenOnIOCompletionPort Thread 059, 1 frame(s) match * 02 0000012972026160 00007ff9291e64dd sqldk!SOS_Node::ListenOnIOCompletionPort
The errorlog is once again your friend and may have some very valuable data, though you may also find nothing. Now, you could proceed without it, however it cuts down the time needed to investigate along with understanding which thread caused the issue. This becomes important later. One thing to point out is that in the errorlog, many of the values are in hex format, while in the debugger (at least windbg) it likes to use decimal, so for example thread 035, which is the debugger thread id, might be OS thread id 0x2124.
IO Completion Listener (0x2124) Worker 0x00000000066BC170 appears to be non-yielding on Node 3. Approx CPU Used: kernel 0 ms, user 15000 ms, Interval: 15002.
Here was can switch to the thread having a problem: ~~[0x2124]s;kc
00 ntdll!ZwWaitForSingleObject 01 KERNELBASE!WaitForSingleObjectEx 02 sqldk!SystemThread::Suspend 03 sqldk!SOS_Scheduler::SuspendPreemptive 04 sqldk!SOS_Scheduler::Suspend 05 sqldk!WaitableBase::Wait 06 sqldk!SOS_RWLock::GetLongWait 07 sqllang!SOS_RWLock::GetLock 08 sqllang!UcsReentrantAutoRWLock::GetLock 09 sqllang!UcsReentrantAutoRWLock::Acquire 0a sqllang!CSbTransmitter::TransportStreamUpdate 0b sqllang!CConnectionEndpoint::Close 0c sqllang!CConnectionEndpoint::ReceiveTask_IdempotentInvoke 0d sqllang!CConnectionEndpoint::SniReadHandler 0e sqllang!SNIReadDone 0f sqldk!SOS_IOCompRequest::ExecuteCompRoutine 10 sqldk!SOS_Node::ListenOnIOCompletionPort 11 sqldk!SOS_Task::Param::Execute 12 sqldk!SOS_Scheduler::RunTask 13 sqldk!SOS_Scheduler::ProcessTasks
You can see in the above stack that the completion routine has been stalled due to waiting on a lock, which for this one is a specific lock with the service broker/ucs layer. The question is we see it spent 15 seconds all in user-mode time, but was it the same stack as before? We can use the same commands from the scheduler issue to check it out: .cxr sqlmin!g_copiedStackInfo+0x20; kc
The takeaway is to help you do some initial triaging of the problem. Sometimes you’ll get callstacks that don’t make much sense, other times you might notice something that has a name close to an item you use in your environment. This could be the difference between having an ongoing issue and not doing anything about it or having a possible mitigation. Much like with public symbols, it only goes so far and there are further things that can be done to get more data with extended events – yes, I can hear you groan while reading the whole “extended events” part but it’s what we have and it works fairly well… for the most part.
Hi Sean, do you have any RSS feed for the blog? I’d love to follow this series, but there doesn’t seem to be a good way of doing so except checking the site periodically.
James
Great call out, I didn’t realize it was missing! I added a quick link to the front page but you can also subscribe with the following url https://seangallardy.com/feed/
Thanks for pointing this out, James!