I don’t often find many people using FileStream in their databases (which isn’t a bad or good thing, in my opinion, just a statement of fact). Some technologies in SQL Server use it behind the scene, such as FileTable or Hekaton, and there isn’t really any getting around it in those cases. However, I was brought an interesting issue by a friend on Database Administrators Stack Exchange, Hannah Vernon (w), when it came to a database that was in a Distributed Availability Group in 2019 and had no issues, but after upgrading to SQL Server 2022, started having a major problem.
The Problem
After upgrading to SQL Server 2022, the distributed availability group would stay healthy for some amount of time… sometimes a few minutes, sometimes a day, but would eventually become unhealthy. When it was investigated, the same error would happen each time:
Error: 3633, Severity: 16, State: 1. The operating system returned the error '2(The system cannot find the file specified.)' while attempting 'CreateFileW' on 'D:\somepath\anotherpath\deeperpath\filestream_file_guid' at 'fsdohdlr.cpp'(2474). Always On Availability Groups data movement for database 'Database_Name' has been suspended for the following reason: "system" (Source ID 3; Source string: 'SUSPEND_FROM_CAPTURE'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
I can’t say I see many suspect from capture issues, in fact it’s one of the _few_ I’ve actually witnessed in the wild. There are four interesting parts here, the first is that the error returned is an OS error, 2, which is the system can’t find the file specified, and I don’t find that OS errors this straight forward are generally misleading or incorrect. The second is that this is coming from CreateFile, which the name is kind of a misnomer of sorts since it’s used to create new files, open existing files, and can even delete files (it really should be called DoFileAction or something, I don’t know I’m not that great at naming things). The third is that this is specifically, and always, on a file for FileStream.
The fourth part is that this results in a SUSPEND_FROM_CAPTURE
state from the database on the forwarder. The forwarder, in a distributed availability group, is the primary replica on the secondary availability group side. Its job is to _forward_ the data coming in from the global primary (primary replica on the primary AG side) to all of the local secondary replicas, in addition to other items such as seeding databases. The odd part about this issue is that it’s have a problem with capturing the log, but in order to capture the log it has to know there is new log… which is would get from the primary, so the error didn’t quite make much sense unless the file actually wasn’t there… but the file _did_ exist.
Investigating and Other Boring Stuff
I worked with Hannah at first to check if the files exist and some other fundamental items that really set the stage for further investigation. We collected procmon captures and some basic extended events data, but nothing was super helpful. I was given instructions on how to reproduce the problem, which only occurred on a database with FileStream, and it was off to investigate.
I won’t bore you with too much of the nitty gritty, but there are some extended events to track log blocks, fragments, FileStream, and pretty much a cornucopia of other super verbose and hard to use events. These aren’t things you’d want to turn on, in general, unless you’re troubleshooting something very specific such as in this case. Just turning these events on and completing a repro of the issue in ~30 seconds, led to XE data in the hundreds of MB. That’s for a single database, with a single threaded light workload, in a controlled environment… imagine if this were in a production environment with 100 other databases and a massive workload. I’ve done that before, going through 600 GB of extended events from 10 minutes of capture isn’t as fun, flashy, and exciting as they make it look on TV (and it takes days to go through, so try not to have a life outside of working).
The big take away here, though, is that it took XE, Procmon, and some light debugger use to verify what was actually occurring.
The Answer Is If 1 Is Good 2 Is Better
Looking at the XE and Procmon data, the issue has one fundamental issue which does seem to need fixing. In SQL Server Availability Groups, the way that FileStream works (10,000 foot view, here) is that the file needs to be created and some special logs records made for it. The logs records and the file need to be sent to the secondary replicas in a very specific order, any deviation from that order will cause all kinds of possible issues. In this case, there are two items at play which sort of break this fundamental contractual issue.
The first problem is that this is a timing issue. There are certain setups that could be achieved where, for various reasons, the timing never hits just right. This means even though the contract of sending the records in a specific order is violated, when it is coalesced on the other side (the forwarder) the timing is such via the transport mechanism (TCP/IP) the order occurs in the correct sequence. This means the setup or configuration could be completely broken, but due to items beyond your control, it actually doesn’t break!
The second problem, and the main issue, is that there was an enhancement made with SQL Server 2022 which allowed for more than one TCP/IP connection for Distributed Availability Groups. Please understand that this is a great feature and should be utilized as it will work in other scenarios, offering a performance increase, just not ones where FileStream is used. This allows for multiple paths for the Log and FileStream information to take, which is where the out of order issue can occur if there are any delays in TCP transmission either resulting from lack of CPU/Memory or on the wire. This effectively allows double the bandwidth to be used with Distributed AGs which can be helpful especially in high latency setups, however it will, currently as of CU7, break any databases that are using FileStream (again, depending on timing).
What to do? Normally when new features come out, there is a chance that not every single edge case, test case, and scenario is thought about or tested… I mean that’s kind of how life works, there’s never been drug, toy, vehicle, etc., recalls for issues not found during development, right? This is why there is generally a trace flag or feature switch that is put in to revert the new functionality, which in this case happens to be via trace flag 5597. If you’re finding yourself in this situation, apply this trace flag to any replica that could be a global primary or forwarder as a startup trace flag. After adding it to the startup trace flag list, you’ll need to restart the SQL Server instance. This will revert to the previous behavior of a single TCP/IP stream between the global primary and the forwarder, which will remove the chance for the timing issue to occur.
Closing Thoughts
Having multiple connection in this case is overall a good thing and I don’t see a reason not to use it nor do I see any major reasons not to upgrade to SQL Server 2022. The problem was very interesting and did take a large amount of time to investigate and troubleshoot, which is what I feel most people seems to forget about. In a world where things are expected to be completed within minutes or hours, taking days or weeks of heads down hard investigation seems like a thing of the past. Troubleshooting and deep investigations seem to be relics of the past and quality skills are being lost. Take the time to learn and grow, it’ll only help you in your future.
Thanks for taking the time to figure this out with me, Sean. You do a superb job; I only wish there were more folks like you.
Excellent explanation. However, it doesn’t seem to do your hard work or time taken justice. You should have bored the hell out of us with explanations. *kidding* 😉