Blog
WebsiteLoginFree Trial
  • 🏠PagerTree Blog
  • πŸ“£AT&T Email to Text Ends June 17, 2025: Switch to PagerTree Notifications
  • πŸ“£Meet the PagerTree CLI: Your New On-Call Sidekick!
  • πŸ“£OpsGenie Shutdown Announced: Why PagerTree Is Your Best Alternative in 2025
  • πŸ’ŽGetting Started With Ruby on Rails in 2024 - The Complete Development Environment Guide
  • πŸ“£WhatsApp Notifications
  • 🧠Site Reliability Engineer (SRE) Interview Questions
  • πŸ‘‘What is System Monitoring?
  • πŸ‘‘Top 5 Best PagerDuty Alternatives in 2024
  • πŸ”‘Understanding Linux File System: A Comprehensive Guide to Common Directories
  • πŸ”‘Ping Command: A Comprehensive Guide to Network Connectivity Tests
  • πŸ“œFly.io migrate-to-v2 Postgres stuck in read-only mode
  • πŸ’ŽMulti-Tenant SSO using Devise
  • ✨PromQL Cheat Sheet: A Quick Guide to Prometheus Query Language
  • πŸ”‘PowerShell Cheat Sheet: Essential Commands for Efficient Scripting
  • πŸ“£Critical Alerts for iOS and iPhone
  • πŸ“£PagerTree 4.0 is finally here!
  • πŸ’ŽRuby on Rails Polymorphic Select Dropdown
  • 🧠SRE Metrics: Availability
  • 🚨Incident Response Alert Routing
  • πŸ’ŽRuby on Rails Development Setup for Beginners
  • ✨Jekyll site to AWS S3 using GitHub Actions
  • πŸ’ŽMigrate attr_encrypted to Rails 7 Active Record encrypts
  • πŸ’ŽRuby on Rails Cheat Sheet
  • πŸ“£PagerTree Forms Integration
  • πŸ“£Public Team Calendars
  • πŸ“£Slack, Mattermost, Microsoft Teams, and Google Chat
  • πŸ“£On-call Schedule Rotations
  • πŸ“£Maintenance Windows
  • ✨Docker Commands Cheat Sheet
  • πŸͺ„Slack Channel Stakeholder Notifications
  • πŸ“£PagerTree Live Call Routing
  • 🧠The Science of On-Call
  • ✨serverless
    • 🧠What is Serverless?
    • 🧠Serverless Scales
    • 🧠Serverless Costs
    • ✨Serverless Tools and Best Practices
  • ✨Prometheus Monitoring Tutorial
Powered by GitBook
On this page
  • Site Availability
  • SRE Essential Metrics
  • History of SRE
  • Conclusion
  • Useful Resources

Was this helpful?

SRE Metrics: Availability

Understanding SRE metrics and how they impact your platform's availability are fundamentals of Site Reliability Engineering.

PreviousRuby on Rails Polymorphic Select DropdownNextIncident Response Alert Routing

Last updated 11 months ago

Was this helpful?

How available is your website, service, or platform? What must you monitor and measure to ensure availability? How do you ? This chart has numbers that every Site Reliability Engineer (SRE) should know. Below the chart, you will find answers to and associated metrics.

Uptime
Downtime (Per Year)

99%

3 Days : 15 Hours : 39 Minutes

99.9%

8 Hour : 45 Minutes : 56 Seconds

99.99%

52 Minutes : 35 Seconds

99.999%

5 Minutes : 15 Seconds

99.9999%

31 Seconds

99.99999%

3 Seconds

Site Availability

There is a saying in the NFL that goes, β€œA player’s best ability is his availability.” The same thing is true for websites, applications, and platforms. You can have a great website or the β€œbest” cloud platform, but if it is not available for your customers when they need it, then your business and your reputation will suffer.

In this day and age, availability is everything, and it comes with a cost. Availability comes in many different forms, like redundancy, load balancing, multiple data centers, and engineering response, to name a few. To calculate availability, we typically look at how long service was unavailable during a specified period of time, taking into account planned maintenance and other planned downtime.

Industry jargon refers to the number of β€œ9’s” related to availability. For instance, one 9 would be 90%, while five 9’s would be 99.999%

SRE Essential Metrics

  • Utilization (% time that the resource was busy)

  • Saturation (amount of work resource has to do, often queue length)

  • Errors (count of error events)

  • Rate (the number of requests per second)

  • Errors (the number of those requests that are failing)

  • Duration (the amount of time those requests take)

  • Latency (time taken to serve a request)

  • Traffic (how much demand is placed on your system)

  • Errors (rate of requests that are failing)

  • Saturation (how β€œfull” your service is)

History of SRE

Conclusion

Useful Resources

Metrics have become the lifeblood of many organizations. Deciding what to and what not to can be just as important as the monitoring tools themselves (, Grafana, systems, etc.). In many instances, there can be an overwhelming urge to gather metrics on every available function, potentially leading to information overload. To keep monitoring manageable and actionable, consider the following methods when determining your needs.

For hardware-related monitoring, consider the .

For services-related monitoring, consider the .

For related monitoring of services, consider the

Tom Wilkie of GrafanaLabs did a great talk on these at GrafanaCon EU 2018. For more information on these methodologies watch the video below or check out this article by .

when Google assigned a team of software engineers to design a concept that would make certain Google websites were efficient, scalable, and reliable. The concepts they used were so successful that other technology companies, like Netflix and Amazon, began using similar concepts as well as improving upon them. In short order, SRE became its tower within the IT architecture domain. SRE is meant to work in concert with but focuses on such things as capacity planning and . Ultimately, on the automation of operations endeavoring to remove the human element so that sites, applications, and platforms can be optimized.

Understanding how availability impacts the delivery of your chosen platform starts with knowing what those numbers look like. For instance, the difference between 2 9’s and 5 9’s goes from days to minutes, per year. Therefore, choosing the proper such as RED, USE, or the Four Golden Signs will allow you to deliver high availability for your specific service. A good starting point to help you define your SRE operations can be found here,

🧠
monitor
Prometheus
USE Method
RED Method
Kubernetes
Four Golden Signs
Grafana Labs
Site Reliability Engineering (SRE) dates back to 2003
DevOps
disaster recovery and response
SRE focuses
observability methodology
Google’s guide to SRE Operations
Grafana Labs
Site Reliability Engineering (SRE) dates back to 2003
Google’s guide to SRE Operations
Tim Wilkie’s GrafanaCon EU 2018 Presentation Slides
Evolution of the Site Reliability Engineer
translate uptime into availability
commonly asked questions about SRE
139KB
SRE Availability Metrics.pdf
pdf
SRE Availability Metrics (PDF)
SRE Metrics