Irrational Exuberance for 11/06/2019
Hi folks,
This is the weekly digest for my blog, Irrational Exuberance. Reach out with thoughts on Twitter at @lethain, or reply to this email.
Posts from this week:
-
Forecasting synthetic metrics.
-
Sending weekly 5-15 updates.
-
Investing in technical infrastructure @ SRECon EMEA 2019
Forecasting synthetic metrics.
Imagine you woke up one day and found yourself responsible for a Site Reliability Engineering team. By 10AM, you’ve downloaded a free copy of the SRE book, and are starting to get the hang of things. Then an incident strikes: oh no! Folks rally to mitigate user impact, shortly followed by diagnosing and remediating the underlying cause. The team's response was amazing, but your users depend on you and you feel like today you let them down. Your shoulders are a bit heavier than just a few hours ago. You sit down with your team and declare your bold leader-y goal: next quarter we’ll have zero incidents.
Your team doesn’t know you well enough yet to give direct feedback when you have a bad idea, so instead they come back to you the next day with a list of projects to accomplish your goal. Their proposals range from the reliable (delete all your software and go home) to riskier options (adding passive healthchecks). You open your mouth to pass judgement on their ideas, pause with your jaw hanging ajar, and then close it. A small problem has emerged: you have no idea how to pick between projects.
It's straightforward to measure your historical reliability using user impact, revenue impact or performance against SLOs. However, historical measures are not very helpful when it comes to determining future work to prioritize. Which work will be most impactful, and how much of that work needs to get done to make us predictably reliable rather than reliable through good luck?
One of my favorite tools for measuring complex areas are synthetic metrics. Synthetic metrics compose a variety of input metrics into a simplified view. For example, you might create a service quality score which is calculated by the service having been deployed recently, having zero undeployed CVE patches, having proper healthchecks, having tests which complete in less than ten seconds, and what not. Instead of having to talk about each of those aspects individually, you’re able to talk more generally about the service’s state. Even more useful, you can start to describe the distribution of service healthiness for all the services across your company.
When I first started experimenting with synthetic metrics, I read an excellent piece by Ryan McGeehan on risk forecasting for security. The techniques McGeehan discusses don’t quite solve the problem of prioritizing future work, but as I ruminated on them I fell into an interesting idea: can we use the forecast of synthetic metrics to determine and prioritize work?
Brian Delahunty recently moved into Stripe’s Head of Reliability role, and as part of that transition I’ve gotten the chance to partner with him, Davin Bogan, Drew Fradette, Grace Flintermann and a bunch of other great folks to rethink how we should select the most impactful reliability projects. What’s written here reflects a great deal of collective thought from those discussions, as well as inspiration from Niels Provos' approach to security planning.
I’ve also previously written about using systems modeling to inform a reliability strategy, which has some similarities to the approach described here.
Forecasting reliability
Let’s dig into a specific example of how we can forecast synthetic metrics. I'll focus on reliability, but I believe this technique is generally applicable to any area which typically grades execution against lagging indicators instead of leading indicators. It's particularly valuable for areas with high-variance lagging indicators like security breaches, major incidents, and so on.
Before we can forecast a synthetic metric, we have to design the synthetic metric itself. To design a synthetic metric for reliability, a useful question to ask ourselves is what would need to be true for us to believe our systems were predictably reliable?
A quick thanks to David Judd who has spent a great deal of time digging into this particular question, and whose thinking has deeply influenced my own.
Some factors you might consider when calculating your reliability are:
- How safely do you make changes? This certainly includes deploying code changes, but also feature flags, infrastructure changes and what not.
- How many fault levels do you have which are backed by only a single fault domain?
- How long has it been since you verified the redundancies within each fault levels?
- How much headroom do you have for traffic spikes?
- How sustainable are your on-call rotations in terms of having enough ramped-up folks and appropriate page rate?
There are an infinite number of factors you could include here, and what you pick is going to depend on your specific architecture. What’s most important is that it should be staggeringly obvious what sort of project you'd undertake to improve each of these inputs. Too many unsafe changes? Build safer change management tooling. Haven’t exercised fault level redundancy frequently? Run a game day. Stay away from output metrics like having fewer incidents, which immediately require you to answer broad questions of approach.
To keep our example simple, let’s imagine we focus the first version of our reliability metric on: making safe changes, eliminating single-domain fault levels, and verifying redundancy within fault levels.
percent_safe_changes =
(percent_safe_feature_flag_changes * 0.3) +
(percent_safe_code_changes * 0.3) +
(percent_safe_infra_changes * 0.4)
percent_redundant_fault _levels = (some calculation)
percent_recently_exercised_domains = (some calculation)
reliability_forecast =
(percent_safe_changes * 0.4) +
(percent_redundant_fault_levels * 0.4) +
(percent_recently_exercised_domains * 0.2)
With some simple arithmetic and experimentation, you’ll compute a score that reflects your reliability risk as it stands today. Then by analyzing the trends within those numbers, you can forecast what that score will become a year from now. If you’ve invested heavily in quality ratchets, then you may find that you’ll become considerably more reliable over the next year without beginning any new initiatives.
You may, on the other hand, find that you’re spiraling towards doom. Either way, this is the starting point for your planning.
This measure then allows you to calculate the impact of each considered project, and to prioritize them based on the return on investment for future reliability. This also helps you structure effective requests to other teams. With this calculation, it’s now extremely clear what you would want to ask other teams to focus on, as well as why the work matters.
A quarter later, you can measure your updated reliability score and compare it against your forecast, determining if your projects had the expected impact. Now you’re able to review and evolve your reliability strategy and execution against something predictable, decoupling your approach from underreacting to narrow escapes or overreacting to unfortunate falls.
Translate score into impact
Now that you’re forecasting reliability, you might say that you are trending towards 52% reliability next year, and 40% the year after that. Some folks will be motivated by the scores feeling low, but these are pretty abstract numbers. How much should the company prefer a reliability forecast of 80% over a forecast of 79%?
It’s powerful to translate the forecasted score into a forecasted result. For reliability, we can calculate the impact of a serious incident and use the score as the probability of such an incident occurring in the next year. A simple version might take the average daily revenue for the next year and then multiply it by 1.0 minus your risk forecast.
impact = (1.0 - reliability_forecast) * avg_daily_revenue_next_year
Is this a particularly robust calculation? No, it’s really not. But are you having a conversation about the dollar impact of your reliability planning? Why, yes. Yes, you are.
Iterate to alignment
Folks will initially disagree with your forecasts and impact numbers. That’s the entire point! Each time someone disagrees with your forecast or projected impact, that is exactly the opportunity you're looking for to refine your methodology. Metrics become valuable through repeated exposure to a medium-sized group with consistent membership over the course of months. It’s only this repeated application that can refine a complex metric into one that reflects both your worldview and the worldview of your stakeholders.
There is a lot to learn from Perri’s approach to product management in Escaping the Build Trap. It’s the results that matters, not shipping projects, and the first version of a metric never has the right results.
Becoming valuable
Sometimes teams working in areas like security or reliability find themselves smothered by the sensation that they are either (a) performing badly or (b) performing well enough to ignore. No place on the continuum between bad and ignored feels particularly good.
The twin techniques of forecasting synthetic metrics and translating those metrics into impact will extend your continuum to include acknowledged value to your company and your users. Even when things happen to be going well, you can showcase the underlying risk that requires continued investment. When things are going poorly, you can show you’re doing the right work, even if it’s not showing its impact yet.
I’m quite curious to hear from more folks trying similar approaches!
As an aside, I also want to mention how useful this approach can be for evaluating the quality of incident remediations. Many incident programs emphasize that folks must have remediations for each incident, and perhaps emphasize a strict deadline on when those remediations must be completed, but it's often unclear if the remediations are of high quality. A good synthetic metric for reliability makes answering this question easy: a remediation's quality is the extent that it shifts the reliability score in a positive direction.
Sending weekly 5-15 updates.
The trendy thing to do on the internet is to start publishing a newsletter. Trendy enough that even I started sending out my blog posts each week in email format.
I've consistently noticed that emails generate far more discussion than other distribution methods, which really shouldn't have surprised me: I've been sending company-internal updates for some time and they've frequently created important, spontaneous conversations.
About a year ago I started my most recent approach to sending weekly updates to relevant public (within the company) mailing lists. This practice is sometimes called a 5-15 report, reflecting the goal of spending fifteen minutes a week writing a report that can be read in five minutes. Personally, I create a new Google Doc each week and record anything I complete there, spending ten minutes polishing the list into something readable each Friday.
These emails have a few goals.
First, it's easy for folks to become detached from their leadership's priorities, and having a weekly update, sometimes a pointed weekly update, is a good way to close that gap. For this purpose, it helps to be as honest and direct about focuses and concerns as you can be without rocking the boat too much. (You probably should be rocking the boat a small amount.)
Second, one of the important contributions of leadership is creating ambient connective tissue across teams and projects. By sharing what I've learned about a new project, I find that often there are other folks who benefit from knowing, and that they wouldn't have learned about the project otherwise. Is reading a huge number of status emails the right way to learn everything? No, absolutely not, but it's a good supplemental method.
As an interesting note, these emails do not need to be widely read to be useful. I often find myself ignoring them initially but then going back to find the latest update from someone to answer a specific question later. Further, a small amount of sporatic reading goes a long way: I've found there is herd-immunity for missing information. If just one or two folks in a given group know something important, it'll end up where it needs to go.
Finally, each half around performance review time, I use these emails to compile my brag documents for the preceeding six months. Inevitably I've forgotten most things I've worked on, and these emails remind me of what I've done in a concise format.
As is often the case, we drove adoption by modeling the behaviors, without ever asking folks explicitly to send them. Most of the folks I work with directly have taken up the practice of sending out similar updates. Which have reduced status updates in one-on-ones, and been helpful to refresh both their and my memory when writing their performance reviews.
I do recommend rolling out this practice, but if you're considering rolling them out, I'd propose two quick rules to ease your initial rollout: (1) create a new mailing list for folks to send them to, not cluttering up existing lists, (2) make them optional to read, as their volume can grow quickly.
Let me know how if goes!
Investing in technical infrastructure @ SRECon EMEA 2019
Earlier this year I wrote How to invest in technical infrastructure, which I got to present at SRECon EMEA 2019 at the beginning of October, and now the video is up.
*|YOUTUBE: [$vid=rYzmXLhIaHQ,$max_width=560,$title=N,$border=N,$ratings=N,$views=N,$orig_embed=PGlmcmFtZSB3aWR0aD0iNTYwIiBoZWlnaHQ9IjMxNSIgc3JjPSJodHRwczovL3d3dy55b3V0dWJlLmNvbS9lbWJlZC9yWXptWExoSWFIUSIgZnJhbWVib3JkZXI9IjAiIGFsbG93PSJhY2NlbGVyb21ldGVyOyBhdXRvcGxheTsgZW5jcnlwdGVkLW1lZGlhOyBneXJvc2NvcGU7IHBpY3R1cmUtaW4tcGljdHVyZSIgYWxsb3dmdWxsc2NyZWVuPjwvaWZyYW1lPg==]|*I've really enjoyed giving this talk, and really enjoyed the talks, atmosphere and attendees at SRECon. There was a real spirit of practioners coming together to learn together that I appreciated.
I also got the chance to give this talk at Velocity earlier this year although I'd just strained my calf and was walking with crutches, so I was glad to get a redo when I was able to practice a bit more and walk without assistance. I'll be both sad and grateful to retire it soon and start thinking about what talk I want to write for 2020, so far I'm leaning towards trying to write a talk around the ideas in A forty year career, but far from decided. (2018 was my first year doing much speaking, and my talk was on migrations.)
That's all for now! Hope to hear your thoughts on Twitter at @lethain!
|