SRE Metrics: Availability

Understanding SRE metrics and how they impact your platform's availability are fundamentals of Site Reliability Engineering.

PreviousRuby on Rails Polymorphic Select Dropdown NextIncident Response Alert Routing

Last updated 11 months ago

Was this helpful?

SRE Metrics: Availability

Understanding SRE metrics and how they impact your platform's availability are fundamentals of Site Reliability Engineering.

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you ? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to and associated metrics.

Uptime

Downtime (Per Year)

99%

3 Days : 15 Hours : 39 Minutes

99.9%

8 Hour : 45 Minutes : 56 Seconds

99.99%

52 Minutes : 35 Seconds

99.999%

5 Minutes : 15 Seconds

99.9999%

31 Seconds

99.99999%

3 Seconds

Site Availability

There is a saying in the NFL that goes, “A player’s best ability is his availability.” The same thing is true for websites, applications, and platforms. You can have a great website or the “best” cloud platform, but if it is not available for your customers when they need it, then your business and your reputation will suffer.

In this day and age, availability is everything, and it comes with a cost. Availability comes in many different forms, like redundancy, load balancing, multiple data centers, and engineering response, to name a few. To calculate availability, we typically look at how long service was unavailable during a specified period of time, taking into account planned maintenance and other planned downtime.

Industry jargon refers to the number of “9’s” related to availability. For instance, one 9 would be 90%, while five 9’s would be 99.999%

SRE Essential Metrics

Utilization (% time that the resource was busy)
Saturation (amount of work resource has to do, often queue length)
Errors (count of error events)

Rate (the number of requests per second)
Errors (the number of those requests that are failing)
Duration (the amount of time those requests take)

Latency (time taken to serve a request)
Traffic (how much demand is placed on your system)
Errors (rate of requests that are failing)
Saturation (how “full” your service is)

History of SRE

Conclusion

Useful Resources

PreviousRuby on Rails Polymorphic Select Dropdown NextIncident Response Alert Routing

Last updated 11 months ago

Was this helpful?