The road to Hell is paved with good intentions.
I get it, your security team is the best ever. You work at a company where the industry (or public sector) requires that servers be “compliant” by constantly and incessantly scanning them for vulnerabilities, in production of course. I know for some it checks a required box, and that’s a discussion about regulation oversight and requirements rather than a technical one that we’re focused on, here. Finally, at some point, you check your SQL Server and notice some errors in the errorlog, something about a weird Unicode login being attempted, SA without a password, connection errors, etc., one of which might be on your TCP Mirroring endpoint (Mirroring, Always On, etc.). This is almost always (I have yet to witness a scenario where this isn’t the underlying issue) due to a security vulnerability scanner and the most common one being Nessus.
Example Error 9642 – Generic error with actual error placeholder:
An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: %i, State: %i. (Near endpoint role: %S_MSG, far endpoint address: ‘%.*hs’)
Example Error 8474 – Note that this is the error number inside the 9642 error:
An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: 8474, State: 11. (Near endpoint role: %S_MSG, far endpoint address: ‘%.*hs’)
Example Error 17836:
Length specified in network packet payload did not match number of bytes read; the connection has been closed. Please contact the vendor of the client library.
What’s really happening.
The long and the short of it is 8474 state 11 means the service broker message (that’s another blog post about how people keep telling me SB functionality isn’t used for mirroring or always on) is corrupt and 17836 state 20 means there was an issue with the TDS packet. There are various actual checks that are made as part of defensive programming to keep SQL Server safe when bad actors such as vulnerability software attempts to do bad things. In most cases, the data and headers of the TCP packets are changed which is the security software attempting to do buffer overrun and underruns among other objectionable things. This can be found by a few different methods but the easiest one is a network trace and inspecting the packets coming across.
Is this cause for concern? Not really… but would you subject something you count on every day to a constant barrage of bad actors? Probably not. While it shouldn’t cause any actual issue, the stars sometimes align, and the shit hits the fan. If the security team is cool with being the people who get the giant dollar cost when the server does have an issue, then wow, kudos, carry on doing bad things in production. If they aren’t then maybe it’s time to rethink hitting your airplane with birds before it takes off to show how resilient it is to bird strikes. It’s stupid.
What now?
You have two major options. Option 1 is to have your security team whitelist that server/port or a combo approach that still allows checkboxes to be checked but doesn’t cause a bunch of clutter in the errorlog. Option 2 is to just stop incessantly scanning the server, which I’ve seen happen at 5-minute intervals, all day, in some financial industry scenarios.
Considering the ROOT CAUSE is an application *maliciously* attempting to cause issues by doing bad things, it’s a wonder people run it against production in the first place. Since this isn’t a SQL Server issue, you’ll want to contact someone who cares enough about your production server to stop trying to actively break it, or at the very least whomever is in charge of monitoring and altering (if that even exists) so that you aren’t alerted (my personal favorite is to copy the entire security team + manager + vp for each one of these alerts) when this happens every Tuesday at midnight.
Footnote: This doesn’t mean you should completely stop checking your production environment for vulnerabilities. It does ask the reader to think about how often this should be done, given they have good change management processes and skills in place and what is to be expected when the testing is run against SQL Server.
We had this very issue. The scanning was driving our alerting mechanisms crazy. Ours was to firewall block the scanners specifically for the SQL ports in question. No more false alerts.
I also ponder the question of what is NOT known. Now whilst hitting those ports may be all well and good, has something been triggered under the covers that may only come to the surface days after when you have a misbehaving SQL server? We can’t see the underlying SQL Server code, who knows what hitting that port has done under the cover. It may look like it hs been rejected but has it?
Hey John,
Exactly, thanks for taking the time to share! It makes me wonder what the point of scanning the server is… if it’s just going to be firewalled which is what should be done in the first place? If we have a block everything but these servers/subnets for all ports except these types of rules, it stops all of this in their tracks. It’s a non-starter. Sadly, I see it the other way all too often where it’s wide open and nothing is blocked. I also see servers with three competing anti-virus or host intrusions vendors installed which causes a whole other set of issues.
In regard to if the connection has been accepted or rejected, I guess it depend son how pedantic you’d like to get on the “Accepted” part. The long and the short of it is that the headers, payload, etc., are checked to make sure they are indeed correct. In these cases with a scanning tool, many of the header items are incorrect, the wrong flags are set, wrong protocol bits are set, etc., and thus it is rejected. Nothing should be happening days later, if it is then it’s almost certainly not from this.