Site Reliability Engineering

Site Reliability Engineering (SRE)

Modern days web applications need constant behold and improvements even when it’s serving less traffic. This also includes monitoring performance, reviewing anatomy, balancing application defects.
Earlier these activities were being handled by forefront developers and system administrators but lagged behind or were left with certain gaps. Site Reliability Engineering is a must have to improve the availability, efficiency, capacity planning, and monitoring.

Why SRE over DevOps

DevOps is more about streamlining development operations for building a robust product. Whereas, SRE is a practice of creating and maintaining a highly resilient service. DevOps primarily focuses more on the automation, SREs focus on stability and scalability of production environment, as well as observability.

What is SRE

SRE allows software engineers to own the daily ongoing operations of the application in the production environment. It deals with practices like real-time monitoring of applications or services and alerting to enhance productivity and development practices to automate and improve the system’s health and availability.
SRE unites development and functioning, by combining software engineering and systems to raise a very highly productive system. It is a practice of creating, maintaining a highly resilient service and focus on stability of production environment, observability, and scale reliability.

Key SRE Capabilities

Monitoring and reviewing application performance stats.Enabling diagnostics for key performance issuesLog indexing and Pattern analyticsIsolating defects and feature requests

SRE Attributes

SLI

Service Level Indication (Informs health of a service.)

SLO

Service Level Objective (Keep track of SLI.)

SLA

Service Level Agreement (Type of business agreement.)

Error Budget

An error budget states the numeric expectations of SLA availability.

SRE (SLI/ SLO) Monitoring using Cavisson Monitoring Suite

Insights into application performance with SLO focused metrics within an interactive dashboard.

  • CPS / CPM, Errors for each services
  • Error, Latency, throughput stats along with time taken by integration point calls
  • Insight into infrastructure health with Disk, system load stats to identify potential issues
  • Drill down to individual requests for detailed insight and RCA
  • Synthetic monitoring to check real time availability of applications

SLO driven real-time alerts

  • Trigger alerts with manual or dynamics threshold for different severity states
  • Drill down from alerts for detailed insight and RCA
  • Configure alerts across different metrics along with percentile/ rate and custom metrics
  • Geo map and Health dashboard allows to track uptime and system/ application health
  • Identify patterns and trends in behavior, and correlate to assess the ongoing viability of SLOs
  • Business performance monitoring using business KPI Dashboard (Order/ Revenue/ cart)

Key metrics to focus

Health and Performance metricsSLO violation duration graphSession DurationBusiness Transaction Response Time/LoadError
Rate
Batch
Latency
ThroughputCounts of
cache hits
Database Response TimeReal-time performance