Irrational Exuberance logo

Irrational Exuberance

Archives
Subscribe
November 20, 2019

Irrational Exuberance for 11/20/2019

Hi folks,

This is the weekly digest for my blog, Irrational Exuberance. Reach out with thoughts on Twitter at @lethain, or reply to this email.


Posts from this week:

- You only learn when you reflect.
- Notes on Building Evolutionary Architectures.
- Expanding on S[a-z]{3,} Reliability Engineer roles.


You only learn when you reflect.

Early in your career, the majority of problems you work on are difficult because they are new for you. You’ve never done it before, and it’s challenging to do good work on problems you’ve never encountered before. However, the good news is that there are other folks on your team who’ve done it before and are already experienced with its in’s and out’s.

Even for garden-variety challenges, it’s easy to spend a disproportionate amount of energy trying to work through the problem before asking for help. Working hard is a platitudinal virtue, but the true virtue is learning well. The important outcome is becoming adept at the task, suffering along the way is optional.

I once interviewed at a company that gave new hires an explicit tool to navigate persevere/learn tradeoffs, the “twenty-forty rule.” Always spend at least twenty minutes trying to solve a problem before asking for help, and never spend more than forty minutes before asking for help. I doubt these numbers are perfectly tuned for your team, but together they’re an effective mechanism to give explicit permission to ask your teammates for help, while also setting the expectation that you’ll spend time helping others.

The approach of working harder to overcome problems mostly works though, when you or someone else is managing the flow of your incoming work. Earlier in your career, your manager or a senior peer will be helping you manage the number of features or tickets you take on each sprint.

As you get more senior, you’ll increasingly be exposed to the unfiltered demands of “the business.” In a slow growing company, the increase is typically slow enough that working harder will keep up with additional work if relieved by occasional hiring.

In faster growing companies or teams, though, working harder quickly becomes a self-defeating strategy. Not only will you become too busy to teach your teammates, you’ll become too busy to learn. This leads you into a downward spiral: you fall further behind the harder you work, eventually burning out.

When you’re overwhelmed by a complex problem, the sort where no one has enough context or perspective to solve it for you, the only solution is to create slow spaces to think. You learn through reflection, especially when it comes to the most complicated problems, and you have to fight your instinct to outwork these challenges.

So that’s the key advice here: if you’re in a rapidly changing, complicated situation – slow down! Stop working harder. There’s no door in that direction. You’ll never outwork deep problems, and that path is gilded with false progress, appearing to work but never getting you where you’re trying to go. (Similar to my argument against follow the sun on-call rotations.)

As a final thought for folks who are managing someone who is trying to outwork a complex problem: help them pause, regroup, create a slow space for thinking, and grow to overcome! There’ll be new challenges soon, so save energy for the way back.


Notes on Building Evolutionary Architectures.

I recently picked up Building Evolutionary Architectures by Ford, Parsons and Kua. It occupies a similar spot in my head as Accelerate and A Philosophy of Software Design, in the category of seasoned software practitioners sharing their generalized approach to software. Altogether, I rather enjoyed it, and it more elegantly explains many of the points I attempted to make in Reclaim unreasonable software.

Below are my notes from Building Evolutionary Architecture.

Summary

The book starts by asking a question that I’ve grappled with frequently: “How is long-term planning possible when everything changes all the time?” Their proposal is evolutionary architecture, which is architecture that “supports guided, incremental change across multiple dimensions.”

Incremental change is both “how teams build software” and “how they deploy it.” Building software this way requires delivering incremental units of value, moving further away from the tragedy of big bang software development. Deploying software incrementally is using the sort of modern development practices described in Accelerate.

Guided change is identifying “fitness functions” to measure the state of important properties like security, availability, and so on. You then rely on these fitness functions to evaluate each change and ensure they’re heading you towards your intended destination. These fitness functions are run in your deployment pipeline, removing the architect from the gatekeeper role, and instead allowing them to focus on guiding rather than enforcing.

Multiple dimensions refers to anything you want to ensure, such as security, data structure, reliability, latency, observability, etc — really any important properties. Each of these dimensions should have one or more fitness functions to support its guided change, or to “project that dimension.”

Fitness functions

Defines fitness functions as, “An architectural fitness function provides an objective integrity assessment of some architectural characteristic(s).” Fitness is usually assessed in your deployment pipeline, preferably as unit tests that can be run locally for the fastest possible development loop.

Atomic vs holistic

Two varieties of fitness functions are atomic and holistic.

Atomic fitness functions “run against a singular context and exercise one particular aspect of the architecture.” Sorbet asserting valid types on a codebase is an example of an atomic fitness function. As would be asserting there are no incompatible dependencies for a codebase.

Holistic fitness functions “run against a shared context and exercise a combination of architectural aspects such as security and scalability.” Ensuring that no personally identifying information (PII) is published into your logging system is a holistic fitness function. Monitoring that latency doesn’t increase with code changes would also be holistic.

Triggered vs continual

Two additional dimensions of fitness functions are triggered and continual.

Triggered fitness functions “run based on a particular event, such as a developer executing a unit test.” These are verifications run on demand in your deployment pipeline, during local development and so on: linting, tests, fuzzing, coverage, etc.

Continual tests “don’t run on a schedule, but instead execute constant verification.” This might be alerts on latency or ensuring infrastructure costs are trending towards budget.

Static vs dynamic

The taxonomy expands further with static and dynamic fitness functions.

Static fitness functions “have a fixed result, such as the binary pass/fail of a unit test.” Examples here are acceptable latency range, acceptable test coverage, unit tests passing, and so on.

Dynamic fitness functions “rely on a shifting definition based on extra context.” These would embody a tradeoff between say freshness and request volume, tolerating less freshness (and consequently more caching) for high volume infrastructure.

Automated and manual

The final distinction about fitness functions is between automated and manual fitness functions, which mean what you’d expect.

Designing fitness functions

You should identify fitness functions as early in a project’s lifecycle as possible, because they serve as a ratchet on quality degradation. However, you’re going to miss a bunch of potentially useful fitness functions since you don’t understand the system until you run it at scale. To account for that, meet periodically (at least annually, but that seems quite infrequent) to refresh and reevaluate your systems’ fitness.

Incremental change

When you’re in a service architecture, ensure that clients can upgrade at their own pace rather than assuming all clients will upgrade immediately and synchronously (they won’t). Also automate the deprovisioning of unused services and versions once they stop receiving traffic. (The pattern of ratcheting out adoption of old versions is particularly helpful, in my experience.)

Recommends versioning services internally instead of clients passing versions. This fingerprints incoming requests and routes them to the correct implementation, but the endpoint that clients call doesn’t change. “In either case, severely limit the number of supported versions,” and in particular ”strive to support only two versions at a time, and only temporarily.”

Testability is essential for incremental change, supporting a rapid development loop. It also moves away from “strict development guidelines (with the attendant bureaucratic scolding).”

They’re going to be tradeoffs between fitness functions over the course of development, and the point is to use them to have a structured conversation as early as possible. They’re also a good safety break to realize you’re heading down an unacceptable path of tradeoffs.

Follow hypothesis-driven development, “rather than gathering formal requirements… leverage the scientific method instead.” This is a fascinating idea that reminds me of Escaping the Build Trap’s vision of product development.

Architectural coupling

Modulary is “a logical grouping of related code” and one of the most important tools to limit architectural coupling. Create modules out of related functionality to maximize functional cohesion. Aiming to scope modules as architectural quantum, “an independently deployable component with high functional cohesion.”

Small modules are easier to change, so generally prefer smaller, but getting the right boundaries is key to balance between coupling and complexity.

There is a discussion of evolvability of different styles of codebases: big ball of mud, monolith, layered architecture, modular monoliths (e.g. monolith but with enforced modular boundaries), microkernel (“core system with an API that allows plug-in enhancements”), event-driven architectures, mediator pattern, service-oriented architecture, microservices, service-based architectures, and serverless. Which of these to pick is less important than designing the implementation well.

Evolutionary data

One of the heaviest frictions for evolving architecture is the underlying data, and that friction has inspired the practices of evolutionary data. This requires that schemas as (1) test, (2) versioned, and (3) incremental. (I’ve personally found django-migrations to have good patterns to learn from here.)

It’s ideal to have shared-nothing architecture where applications don’t directly integrate against the same database. If you do, consider the “expand/contract pattern”, which allows you to support broader functionality, transition incrementally, and then remove the old using a combination of code rewriting, code ratchets and so on.

Also introduce the concept of inappropriate data coupling, for example transactions force large architectural quanta. Anything within a transaction needs to be deployed with the other pieces contributing to that transaction. Transactions are also often owned by a database administration or infrastructure team, which introduce cross-team coordination aspects as well.

This section also makes a great observation of how weak DBA tooling is, why are IDEs so good and DBA tools so poor? They blame vendors-as-religion behavior from DBAs, but blaming the lack of tools on DBAs being devoted to their vendors feels a bit reductive.

Building evolvable architectures

Tips for building evolvable architectures:

  • “Remove needless variability” through adoption of immutable infrastructure, long-lived feature flags, and so on.
  • “Make decisions reversible” by making it easy to undo deploys and such. Prefer immediately shifting traffic off a broken new version to slowly deploying a previous revision. Prefer flipping flags to disable new features over deployment, etc.
  • “Prefer evolvable over predictable.” If you optimize for the known challenges for an architecture, you’ll get stuck because there are at least as many unknown challenges as known challenges. It’s better to be able to respond to problems quickly than to cleanly address what you’re currently aware of.
  • “Build anticorruption layers.” Mostly this means building good interfaces so you can shift out the implementation underneath. The act of adding interfaces can be expensive as well, so balance this with finding the last responsible moment to make the decision.
  • “Build sacrificial architectures.” Assume that you’ll make tradeoffs that won’t last forever, and be okay with occasionally throwing away your implementations. Uses example of Ebray rewriting from Perl to C++ to Java over course of seven years.
  • “Mitigate external change.” For example, don’t rely on global package repositories, but instead pull in copies of packages you need locally. Then you can manage your upgrade timing in addition to owning your build pipeline reliability.
  • “Libraries versus frameworks.” Argues against frameworks, since you write code that integrates with frameworks, as opposed to writing code that calls out to libraries. Consequently, there is tighter coupling in frameworks.
  • “Version services internally.” Discussed earlier in these notes, don’t leak version identifiers to users, instead inspect the incoming requests and handle them appropriately. Easier for service to manage that complexity than all clients to handle it.
  • “Product over project.” Structure your teams, and consequently your architecture, around long-lived products, not to short-lived projects.
  • “Dealing with external change.” Your clients (as in, software generating requests to your service) will change their needs over time, define explicit contracts to state these agreements, perhaps as Service Level Objectives.
  • “Culture of experimentation.” Evolution is built on structured, measured experimentation, which requires a structured, thoughtful approach.
  • Start with low-hanging fruit. Easy wins beget larger wins, start where it’s easy. (I also had a chat earlier this week with Keith Adams who described a power function for which files/systems are frequently changes, so perhaps start where most changes happen.)

Books

Books recommended within Building Evolutionary Architectures:

  • Continuous Delivery
  • Lean Enterprise
  • Domain-Driven Design
  • Refactoring Databases

Final thoughts

This is a fantastic book, easily falling into the same rare category as Accelerate and A Philosophy of Software Design. I could easily imagine asking a team I was working with to read this book together and reflect on its practices. If you haven’t gotten a chance to spend time with it, it gets a strong recommendation from me.


Expanding on S[a-z]{3,} Reliability Engineer roles.

One of my foundational learning experiences occurred in 2014, when I designed and rolled out Uber’s original Site Reliability Engineering role and organization. While I’d make many decisions a bit differently if I could rewind and try again, for the most part I’m proud when reviewing the reel of rewound memories.

Folks will occasionally ask my advice on introducing SRE in their company, and I give them an answer they don’t expect: don’t. The one word version comes across rather more controversial than I intend. I love the approaches that define good SRE organizations, and love the SREs I’ve worked with, but I believe that industry preconceptions of the role are sufficiently muddled that the term is actively unhelpful.

To grow a team using best SRE practices, skip the label “SRE.”

Label stuffing

When looking for an egregious example of label stuffing, it’s hard to find better than Long Island Ice Tea Corp renaming to Long Blockchain Corp. It was an ice tea company before the name change, and it was an ice tea company after the name change, just an ice team company that really wanted to be valued as a blockchain company. Thus far, only the SEC has valued that distinction.

Some engineering organizations have committed a similar maneuver, renaming their system administration groups as SREs. While the majority of those renames are done in good faith and following the scriptures of the good book, many of them don’t nail that transition. Perhaps a small subset are these rebrands are truly cynical, hoping to improve their hiring fortunes without changing practices, but my sense is that the vast majority are well-meaning folks struggling to land a difficult cultural transition.

Does the distinction matter?

This label stuffing is important, because these two styles of work are incompatible, and create a dysfunctional or ineffective organization when applied in tandem, The two critical distinctions between systems administration and SRE organizations are how they (1) handle ownership and (2) create leverage.

Administration groups split ownership such that routine workflows cross organizational boundaries between them and their peer development groups: you write the code, I’ll deploy and operate the code. SRE groups split ownership in ways that ensure common workflows do not cross organizational boundaries: I’ll build the deployment platform, and you’ll use it to deploy.

These approaches to dividing ownership also impact how they create leverage for the organization they support. In most system administration organizations, the fundamental unit of progress is working hours. This is distinct from the software engineering and SRE teams I’ve worked with, where the fundamental unit of progress is working software.

When I joined Uber, services were provisioned manually through a sequence of highly coupled Puppet commits and deploys. When I left, services were provisioned by typing in the service name, clicking a button and waiting thirty seconds. No amount of additional human effort could have met the ramping business need for service provisioning, although certainly we could have kept trying harder and burned ourselves out.

“Just filter in the interview!”

Some folks will agree with the premise that the SRE label has become overly broad, but suggest that it’s easy to refine your hiring process to filter for the approach your organization prefers.

In general, this seems like the obvious approach, but I’ve found that the label stuffing exerts an ongoing, exhausting pressure on the interview loop itself. Folks get frustrated that the loop is filtering our great SREs that they’ve worked with before who were very successful in the SRE organization at another company, and this becomes evidence that your SRE loop is flawed.

With a very clear vision of how you want SRE to operate at your company, you can prevent erosion of evaluation, but why spend your life doing that when you can skip out on the label’s confused preconceptions entirely?

Descriptive teams

My approach for the last few years, as well as what I recommend to others is to drop the SRE label entirely and hire software engineers. Do you abandon all hope of hiring folks with SRE expertise?

No! Move the specialization into the team’s mission instead of the role’s label. For example, create a team or group responsible for reliability and call that team the Resiliency team or the Reliability team. Advertise for software engineers to join that team to work on reliability, and hire those who you think will make your team successful, including those who have previously worked in successful SRE organizations.

But the inbound funnel…

The last concern I’ll hear from folks is that if they don’t use a frequent search term like SRE, they’ll miss out on folks who would only apply to an SRE role. My experience is that by posting a specific team job description, you’re freeing up time spent filtering to invest into sourcing and growing your organization’s public SRE brand.

I can easily imagine an organization who finds that “just turning on SRE” for their hiring increases inbound significantly, but my experience has been that it’s more of a wash in terms of your long-term goal of hiring folks who’ll be successful within your organization and align with your team's approach.


Summing it all together, I’m not against anyone using the label SRE, I just think it’s more effective to avoid at this point in the spincycle. If it’s working really well for you, then by all means keep using it. As for me, I’ll be over here writing more specific team descriptions.


That's all for now! Hope to hear your thoughts on Twitter at @lethain!


This email was sent to *|HTML:EMAIL|*
why did I get this?    unsubscribe from this list    update subscription preferences
*|LIST:ADDRESSLINE|*

*|REWARDS|*
Don't miss what's next. Subscribe to Irrational Exuberance:
https://lethain...
https://www.lin...
Bluesky
Twitter