When you start researching how to improve the reliability of your software, you will soon run into terms like SLOs and SLAs. It can sound intimidating, but it's quite straightforward to understand. In this post, we will introduce these terms, the differences between them and how to start using them to make your systems more reliable.
SLO and SLA
SLO (Service Level Objective) for a service is a reliability target that teams want to achieve (e.g. 99.9% uptime for APIs). SLOs are set for internal teams and they are different from SLA (Service Level Agreement) that the business team will sign with customers, along with some penalties that may be payable if these SLA targets are not met. If you have SLAs with your customers, your SLOs will be derived from that.
Why set SLOs?
Having clear SLOs can make it easy to take decisions about reliability of your systems. Otherwise, you will never have the data to help prioritise engineering team effort between features and reliability related tasks. SLOs also give all stakeholders a clear understanding of the performance of their systems.
Why your SLO targets should not be 100%
Your SLOs should be aligned with customer satisfaction. They should be set at a level that your customers will be satisfied with your product. The initial temptation for teams is to set SLO targets at 100% e.g. Website uptime should be 100%, APIs should meet their target response time 100% of the time etc. But you should avoid setting SLOs at 100%, because it is not possible to achieve. If you set SLOs at 100%, all your engineering effort will be spent on making sure that your services never go down. This will slow down your feature velocity and lead to customer unhappiness.
The following table of nines might be helpful to understand what your SLOs should be. Every extra nine that you add to the SLO will have additional cost associated with it, while most customers might never notice it.
|SLO %||Nines||Downtime per month|
|99 %||2 nines||7.2 hours|
|99.9 %||3 nines||43.2 minutes|
|99.99 %||4 nines||4.32 minutes|
|99.999 %||5 nines||25.9 seconds|
Setting and measuring SLOs
To understand what parts of the your system should SLOs cover, you should think about what do users care about more in your service. e.g. If you are an email provider like Gmail, being available all the time is very important so your SLOs should cover availability. If you are an online multiplayer game, users really care about lag between their actions and the impact it has on the game, so your latency based SLOs will be very important.
You can check whether you are meeting your SLOs by checking the SLIs. Service Level Indicator (SLI) is a measure of your service. Depending on what metric your SLO covers, you will have a corresponding SLI which calculates that metric. The type of SLI tells you what aspect of the service are you measuring. The different types are - availability, latency, correctness, freshness, quality, coverage and durability.
Here are the different aspects of your service that you can measure -
- Availability: This is one of the most popular SLAs because users really care about a service being available any time that they want to use it. The SLI here can be [(number of successful requests / number of total requests) * 100]. You can measure this using uptime monitoring services like Checkly.
- Latency: This covers the speed of your service and is an important one for many types of products. If your product is slow, you will start losing customers and revenue. Example SLO can be that 99% of website page requests should be loaded in less than 100ms. You can measure latency using performance monitoring tools like Datadog.
- Quality: When your services are overloaded, you should still make the service available to users in degraded state. You should measure how many user interactions resulted in lower quality responses. E.g. If Instagram is facing overload, it can decide to show images in lower resolution.
- Freshness: This metric measures the “recency” of the information accessed by the user. E.g. If your application updates reports data only every hour, then user could be accessing older data when they check the reports.
- Correctness: If you have a service which takes in data and performs computations on it, then correctness will measure the number of times the output is correct based on the input.
SLOs should be tracked over a time window, which can be a calendar time (1 month) or rolling window (last 4 weeks). Shorter windows will allow you to respond to SLO violations faster. This SLO data can help you make decisions about task prioritisation and resource allocation. It will also help you with deciding the different aspects of your on-call schedules like coverage, escalation times etc.
It's important that reliability become a team effort and so maintaining SLOs should not only be the responsibility of development or operations team members only. Doing this will help you create a blameless culture so team members are empowered to speak up and point out real problems in achieving your reliability goals.
I hope this post helps you get started with SLOs to make your systems more reliable. To understand how to set up alerting so you don't miss your SLOs and SLAs, sign up at Spike.sh or email us at [email protected]