SRE & AIOps Glossary

Key terms and concepts in AI-powered site reliability engineering, incident management, and modern DevOps — explained clearly.

Featured

AI SRE

AI-powered Site Reliability Engineering, the application of artificial AI and machine learning to automate and enhance traditional SRE practices. AI S...

A B C E I M O R S T

A

AIOps

Artificial Intelligence for IT Operations, the use of AI and machine learning to automate and enhance IT operations tasks such as event correlation, anomaly detection, and root cause analysis. AIOps platforms ingest data from multiple monitoring tools and use ML models to reduce noise and surface actionable insights.

Alert Fatigue

The desensitization that occurs when engineers are overwhelmed by excessive, noisy, or low-priority alerts. Alert fatigue leads to slower response times, missed critical incidents, and increased burnout. AI SRE platforms combat this by filtering false positives and delivering cause-based, context-rich alerts, often reducing alert volume by 90% or more.

B

Blameless Postmortem

A structured review conducted after an incident that focuses on understanding what went wrong and how to prevent recurrence, without placing personal blame. Blameless postmortems promote honest reporting, improve institutional learning, and are a keystone of healthy SRE culture.

C

Change Failure Rate (CFR)

The percentage of deployments or changes that result in a degraded service or need remediation (rollback, hotfix, or patch). CFR is one of the four DORA metrics used to measure software delivery performance. A lower CFR indicates a stable and reliable release process.

E

Error Budget

The maximum amount of unreliability a service can tolerate within a given period, derived from its Service Level Objective (SLO). For example, a 99.9% availability SLO allows ~43 minutes of downtime per month. When the error budget is exhausted, teams typically freeze feature releases and focus on reliability improvements.

I

Incident Response

The structured process of detecting, triaging, mitigating, and resolving production incidents. Modern incident response comprises defined roles (Incident Commander, Communications Lead), runbooks, and communication channels. AI SRE accelerates every phase, from automatic detection and root cause analysis to suggested remediation steps.

M

Mean Time Between Failures (MTBF)

The average time elapsed between one system failure and the next. MTBF measures the reliability of a system's operation without interruption. A higher MTBF indicates greater system stability. It is calculated as total uptime divided by the number of failures in a given period.

Mean Time To Detect (MTTD)

The average time it takes to discover that an incident or issue has occurred, measured from the moment the problem begins to when it is first identified. Reducing MTTD is critical because every minute of undetected failure extends user impact. AI-powered monitoring can shrink MTTD to near zero through continuous anomaly detection.

Mean Time To Resolution (MTTR)

The average time from when an incident is detected to when it is fully resolved, and service is restored. MTTR is a key reliability metric. Traditional SRE teams average ~3.5 hours, while AI SRE platforms such as Sherlocks.ai reduce this to ~22 minutes through automated root cause analysis and guided remediation.

O

Observability

The ability to understand the internal state of a system through examining its external outputs, primarily logs, metrics, and traces (the "three pillars"). Observability goes past traditional monitoring by empowering engineers to ask arbitrary questions about system behavior without deploying new instrumentation. AI SRE enhances observability by automatically correlating signals across all three pillars.

On-Call

A rotation schedule where engineers are designated as the first responder for production incidents outside of normal working hours. On-call responsibilities include acknowledging alerts, triaging issues, and driving resolution. AI SRE reduces on-call burden by handling initial triage, providing root cause analysis, and filtering out non-actionable alerts before they page engineers.

R

Root Cause Analysis (RCA)

The systematic process of identifying the underlying cause of an incident rather than just its symptoms. Traditional RCA can take hours of manual investigation across logs, metrics, and deployment history. AI SRE automates RCA by correlating signals across infrastructure, application code, and recent changes to determine the root cause in seconds.

Runbook

A documented set of step-by-step procedures for handling specific operational tasks or incidents. Runbooks standardize responses to common scenarios (e.g., database failover, scaling events) so any on-call engineer can follow them. AI SRE platforms can automatically suggest or execute relevant runbook steps during incidents.

S

Service Level Agreement (SLA)

A formal contract between a service provider and its customers that defines the expected level of service, including uptime guarantees, response times, and remedies for breaches (e.g., service credits). SLAs are backed by internal SLOs and SLIs that measure whether commitments are being met.

Service Level Indicator (SLI)

A quantitative measure of a specific aspect of a service's performance, such as request latency, error rate, or throughput. SLIs are the raw metrics that feed into Service Level Objectives (SLOs). Choosing the right SLIs is critical; they should reflect what users actually experience.

Service Level Objective (SLO)

A target value or range for a Service Level Indicator (SLI) that defines the desired reliability of a service, for example, "99.95% of requests complete in under 200ms." SLOs balance reliability with development velocity: too aggressive and teams can't ship features; too lenient and users suffer.

Site Reliability Engineering (SRE)

A discipline pioneered by Google that applies software engineering principles to infrastructure and operations. SRE teams are responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of production services. SRE treats operations as a software problem and uses automation to eliminate toil.

T

Toil

Manual, repetitive, automatable, tactical work that scales linearly with service growth and provides no lasting value. Examples include manually restarting services, hand-editing configurations, or running routine maintenance scripts. A core goal of SRE is to reduce toil below 50% of an engineer's time so they can focus on engineering work that improves reliability at scale.