This is inspired by all the times I am forced to repeat myself. To cut down on hearing my own horrible sounding voice, I present to you a FAQ style post of all things Availability Group related that I am consistently asked.
Last updated June 25th, 2020.
Q1: What’s the limit of databases I can put in an AG?
Sean: However much your hardware can support.
Q2: How much can my hardware support?
Sean: You’ll know when you run out of worker threads, hit IO limits, CPU limits, etc., but I wouldn’t go more than 5 mostly idle databases per core right now with the latest hardware.
Q3: Do my secondary replicas really need to be the same level of hardware as my primary replicas?
Sean: They should, this assumes you are eventually going to fail over to them or they might need to run your workload. It does *not* assume they will be used for readable workloads. Remember that the secondary replica is already accepting and writing log data, redoing that data, and then if you’re running a bunch of read workloads on top of that it might actually be *busier* than your primary. Yes, I’ve witnessed secondary replicas getting hammered harder than their associated primary.
Q4: Ok, so is there a limit on the number of AGs I can have?
Sean: No, but I wouldn’t go willy nilly with them because they are all cluster resources and will require listeners. This can have negative effects on Windows Server Clustering if there are.. what some might call… a plethora of resources. This is especially true if you query the cluster DMVs in SQL Server as those are all hitting the cluster APIs, adding even more load and randomness to your setup.
Q5: How do I know how many and what databases to put in an AG?
Sean: AGs should be viewed as a logical collection of databases. If you look at it this way, the databases that are needed to all fail over together to keep that application running is what should be in the AG. I know some people take the entire server approach, which might work well for you, but do remember that there are other options such as database health checking that was added in 2016. This means you can have a single database fail over your entire AG if it has an issue, which might not be what you want.
Q6: Can I use a gMSA with Always On?
Sean: Yes, whether it’s an FCI or an AG you can use a gMSA.
Q7: Can I <insert some crazy architecture drawing> do this?
Sean: You can do most things with AGs and FCIs… but I want you to take a step back and ask yourself, “Do you want to support that at 2 am on a holiday weekend in an imbibed state?” Probably not. So if I don’t want to, and you don’t want to… best to just not do it that way.
Q8: Can I have more than one instance of SQL Server on each node in an FCI/AG?
Sean: You can, but I wouldn’t. When you mix instances, it makes troubleshooting harder and you get into nit picking for resources. If you do multiple instances, you’ll want to set scheduler affinity, max server memory, I/O governance, etc., so that one instance doesn’t stomp on the other. In reality I’ve yet to see anyone do this correctly or… really… at all… and then spend however long trying to figure out why there is a performance issue on a node that has 7 instances of SQL Server, which of course are all active all the time. I don’t want to deal with that pile-o-crap, so you probably don’t either.
Q9: Should I use autoseeding?
Sean: This is a huge “it depends” answer but here are the guidelines I give. If it is going to take more than 15-20 minutes to seed your database (if it takes you longer than that to do a regular restore, for example) then don’t use autoseeding. I’m sure you’re all asking where the number came from and why, so let me explain. The number came from thin air, it’s completely made up but based on normal log backup times that I tend to see in the myriad of environments to which I am exposed. In many of these environments, having the log not be able to truncate for 30 minutes can mean not giving up tens of GB which could result in log growths, lower throughput, etc., which might not be your specific situation – so use your own best judgement – however the longest autoseeding I know of took about 14 hours which means it can work, it’s just not a good idea.
Q10: Can I control the network bandwidth usage of my AG?
Sean: No. In fact, that’d be a terrifically terrible idea. You’ll either end up with a log send queue (lost data) or blow out your transaction log. If you were *really* inclined to step on rusty nails then you can use your favorite hardware or software to QoS this traffic… but I don’t know too many people that purposefully like to step on rusty nails.
Q11: Should I do many AGs or a single AG with everything?
Sean: This is effectively Q4 + Q5 asked in a different way. Generically, at the end of the day, whether you do 5 AGs with 2 databases or a single AG with 10 databases, there isn’t too many more resources used doing one over the other (within reason, for example don’t create 1500 AGs each with a single database in them) so effectively at small numbers you can think of this as a wash. Either way is fine. If you create more AGs, that’s more listeners and more cluster resources but finer control over AG primary usage and database failover. Using a single AG removes admin overhead at the expense of finer controls. It’s really dealer’s choice at small numbers. Higher numbers you’d want to balance the usage with a preference for less individual AGs.
Q12: The database says “synchronized” but when I query the secondary I don’t see the updated data, why?
Sean: There are too many things to unpack for a simple answer here, so I’m going to do some hand waviness and say that I’ll talk about this in a different [not yet made] blog post. The short answer is… there are a few reasons. First, commit is only on writing to the log and not redoing the log. Second, the decoupling of these processes mean they will run asynchronously and there is no guarantee that it will force any sort of data SLA for a readable secondary. Third, the redo process can be blocked by running select queries because it still is required to be consistent for the queries executing. Fourth, the secondary readable replica is using snapshot isolation for the select queries under the covers so none of that new data redone will be visible to the running query.
Q13: When should I used a Distributed Availability Group?
Sean: This one has some nuance to it. Obviously the simpler the design, the better… having said that, there are some great uses for it such as migrating across clusters and long distance disaster recovery (note I didn’t say high availability). There are some extremely niche scenarios that I am not mentioning, but the gist of it is the more of these environments you daisy chain together, the better your monitoring, alerting, and runbook/playbook/automation infrastructure should be. Why? Each of these are different clusters, that can each have their own problems. If you have one issue on a secondary on the forwarder side and it goes unnoticed, it’ll propagate back to the global primary and your whole environment is now having an issue. It actually puts a larger onus on making sure all eleventy-billion servers you want running this way are in tip-top shape. Anything less than stellar will cause you issues and headaches.
Sean, really a useful post and no, rusty nails are no one’s favorite.
The only addition I have is about Q11. If you have multiple AGs and only some fail over, you are running into licensing issues potentially. I’ve heard of audits resulting in penalties due to that happening.
Great point! Yes, if you use multiple AG’s then it is incumbent on your admins to make sure you aren’t violating any licensing items :X
This is why I ended up with only one AG on the server. I’m not licensed for multiple readable instances.
Q7 – This. All day, all night, everyday, every night! Having inherited a couple of these “this will be really cool to do” systems. AND. Having to support them in an “imbibed” state, I can authoritatively state that every DBA and systems architect on the planet should read this carefully.
Exactly. It’s really bad when someone is attempting to explain the architecture to me, with the entire system down, and they can’t even articulate it well enough to give me an understanding of what’s there. It makes it very difficult to give help 🙁
I have FCI\AOAG. Is there a way around failover to perform a restore on the secondary for a new database? Want clarification here as well. The disk is shared which means only one node has the dependency so I can create the database on Node1 (storage dependency) yet can’t restore on Node2 because it doesn’t have a storage dependency and has no clustered disk, therefore I’d have to move the storage to Node2 via failover to restore, correct?
Lynn,
If you have an AG on top of an FCI then the secondary node can’t be on the same node that could host an FCI. If you fail over the FCI the database will already exist and the AG should run as normal as the AG starts up after the FCI. Maybe there is some detail that was left out?
My shop has the luxury of being able to get all our databases on a single server. This is clustered, and we have a single AG that contains (almost) all databases. We then use a dAG to propagate the databases to our BCO location, which also has a single SQL cluster. Do you have any hints about dAGs you want to share.
I was tempted to add some items around distributed AGs and I believe you’ve given me the impetus to do so. My *personal* opinion on the matter is avoid them if you can, if you can’t… make sure you have a very, very, robust monitoring and health checking framework in place. There is quite the amount of inferred baggage that distributed AGs bring, so while they are very good for some things, I do find them over used for most other things.