Summary: you should skim this free online book to see inspiring ideas of how administration works at scale, although don’t expect to put the practices into place without management buy-in.
Let’s get one thing out of the way first:
you, dear reader, probably don’t work at Google scale.
Google faces similar problems that your employer does, but just at a different quantity. Instead of keeping a 3-server applications reliable, they need to keep 3,000-server applications reliable. As a result, they have very different budgets than you do, and it gives them the luxury to treat it as a serious discipline.
In their free Site Reliability Engineering book, they share some of the lessons they learned about:
- Designing service level objectives
- Tactics like deployments, monitoring, automation, and release engineering
- How to load balance and handle overload, and much more
You really don’t need to read the whole book – just skim it, and you’ll take away interesting concepts and stories. While DevOps and SRE aren’t the same thing, you’ll start to see how your DBA duties, DevOps duties, and developer duties all blend together to work towards the same business goals.
My favorite concept: error budgets
Say you’re given a 99.5% uptime goal. Instead of thinking of it in terms of time, think of it as, “0.5% of my service’s requests may result in errors.” Maybe it’s the entire service is unusable, maybe it returns a failure of some kind, maybe it times out.
Instead of aiming for 0% errors, aim for 0.5% or less errors.
0.5% is your error budget, and you’re expected to spend it.
You may spend it on planned outages for software deployments, spend it accidentally in the form of unplanned outages, or purposely for things like patching or major app code deployments, or maybe just plain old experimenting by cutting costs. This starts to set the stage for why we need DevOps – developers want to spend part of the error budget on deployments.
DBAs usually aim for zero errors. DBAs don’t want to spend the error budget at all, but the business needs us to. If you don’t use any of your error budget, that’s a problem because it indicates that you’re probably spending too much (money or resources), not doing the right upgrades/patching to keep your application current, or trying to make developers work too hard to build absolutely perfect deployments every time (which cost a ton of money to build). You should probably look at ways you could cut costs in order to get closer to the business’s objectives.
Prefer videos? They’ve got videos too.
Check out the reliability engineering talk from Google Cloud Next 2017:
I’m certainly not saying that Google does everything right, and that you should model all of your practices after theirs. That’s ludicrous – they’re huge, and they have huge budgets. But there’s some interesting lessons about your own database operations and deployments.