Confusing Terminology
The terms SRE, SLI, SLO, SLA, and Error Budgets have already become a lingua franca among people working in operations and distributed systems. Yet the exact set of words is used differently, confusing their meaning. Furthermore, knowing the terms does not help one define a practical and applicable reliability approach, which is the core goal of Site Reliability Engineering (SRE).
I will provide an intuitive approach to SRE terminology in this series of letters. I will also share my experience and the practices that worked effectively to upskill smaller organisations from "we don't monitor" to "we use reliability data for decision making". I will start by defining Service Level Indicators and which of them you should collect, gradually moving into organisational challenges you will face when adopting the Site Reliability Engineering approach.
Indicators
Before defining the Service Level Indicators (SLI), it's important to note what an indicator is. Unfortunately, there is no one correct definition.
An indicator could be a gauge or meter of a specified kind. A speed indicator determines the current speed of a car. A barometer indicates the air pressure.
An indicator could also be a test substance. For example, a pH indicator is a chemical compound added in small amounts to a solution so the pH of the solution can be determined.
Yet another type of indicator is a threshold value. A waterline indicates the point on the hull of a ship or boat to which the water rises.
Sounds confusing? It does get worse because indicators could also be objective and subjective.
Subjective And Objective Indicators
Most indicators in the wild allow you to determine the value of a specific measurement. What is the current speed? What is the pH level? What is the water level? Those indicators are objective. The car's speed is either 60km/h or 80km/h. It can not be both.
Not every indicator is objective. When you read a restaurant review, you look for particular indicators: Is the food tasty? Is it expensive? Is the restaurant close to other attractions? How long do you need to wait? The answers to those questions are very subjective.
To make matters worse, in a lot of cases, one can derive subjective indicator from objective and the other way around by rephrasing the question:
What is the speed of a car? — Objective. Is the car's acceleration fast enough? — Subjective. You may want to reach a destination. You may also want to feel like a rockstar.
What is the average price of a meal in the restaurant? Objective. Is the average meal in the restaurant expensive? Subjective. The expensiveness depends on the size of your wallet and how rich you feel today.
Service Level Indicators
Alright, there are multiple definitions of indicators, and to make matters worse, some of them are objective, others subjective, and you can turn one into another. What do we do with this mess?
We start reasoning from the first principle. What is important? Why do we care about indicators to begin with?
The answer: we don't care about indicators. We care about users using our product. We care about our company's growth. We care about doing something good to other people. The things we truly care about are a) very subjective and b) measured across multiple dimensions. It would be nice to measure precisely how much good each particular service brings to our product with an excellent, unambiguous metric, but this is impossible.
From Goal To Indicator
However, quantifying a high-level goal, for example, user satisfaction, into multiple sub-goals is possible. And then quantify them even more until we get into a measurable scale that you can put an indicator on. You likely want this indicator to be objective as well to reduce the amount of ambiguity.
Can we measure user happiness? No, it's too abstract; you need to go deeper.
What makes our users happy? Many things. For example, users are happy when all the features they rely on work as expected. Can we measure that all the features work? Kind of. Not as a single "yes" or "no" question, need to go deeper.
What exactly does "work as expected" mean? Users don't see many errors, and the latency is low. Can we measure error rate and latency? Yes! We have finally got deep enough to unravel an indicator.
We have just discovered that the error rate could be used as one of the many indicators of a user's happiness because it is an objectively measurable metric. But the job doesn't stop here:
Many other different indicators make users happy. Unravelling them will require returning to the question "What makes users happy?" and going on a different path, resulting in an additional indicator.
After collecting multiple indicators, there are still a lot of nuances to consider. Some indicators may influence user happiness more than others, and others are just too hard to collect. There are tradeoffs everywhere.
The main analysis loop stays the same: Identify your key goal. For example, user satisfaction could be a good goal. Then, drill down from the goal until you hit an objective measurement. This objective measurement is your Service Level Indicator.
Summary
Indicators could be objective and subjective. They could also mean a variety of things. When thinking about Service Level Indicators, your goal should be to find objective metrics that could be tied directly to the end goal, such as user happiness. Focus on the end goal makes Service Level Indicators different from the other metrics you collect about your services.