It is 2:13 AM and your phone is flashing and ringing. It is PagerDuty alerting you to a production issue. Technology has certainly evolved since my early days at Amazon Web Services where I had a physical pager :) but this experience is surely very familiar to many engineers developing and operating software in the modern world of SaaS.
We live in a world where 24/7 support is normal and users expect your system to be up and running flawlessly all the time. No one has the patience to wait an extra second for your site to load. This puts a very high expectation on the engineering teams building and operating such systems. As teams move fast to build new features and continuously deploy and change the production environment, failures are bound to happen. Yet, you need to maintain operational excellence to ensure your system is highly scalable and available despite constant change.
In this post, I will describe a 3 step framework to help you establish a culture of operational excellence in your organization. Failures in production are a fact of life, so how to use them as learning opportunities is a key to building a robust system and culture. This is the essence of building DevOps best practices in your organization. The following are the 3 steps you want to employ when approaching operational excellence:
- DETECT: When do we know that an issue exists in our system?
- RECOVER: How quickly can we resolve the issue?
- PREVENT Where to solve the issue so it does not happen again?
Detect
Since failure will happen, the goal is to always be notified of such issues in your system before your customer. This is the objective your organization needs to strive for. This means you must start with clear service level objectives (SLO) for your key systems. A good SLO is customer-centric and specific. Here is a bad SLO example: “The UI needs to be responsive”. What does that even mean? What is considered responsive? Which UI? Here is a better example “Loading the orders dashboard must not exceed 1 second at the 95 percentile (5-minute windows)”. This is more specific, the 95 percentile for latency should be equal to or less than 1 second. I can measure this and know if I am violating this objective. It is customer-centric as it describes something the user can see and gets value from.
From this, you can drive more metrics that you likely need to monitor (e.g. client-side vs server-side? service vs database latencies). It is hard to monitor every single detail but when you focus your top SLO on customer-centric interactions, you are more likely to detect customer-facing issues before most customers. The most common SLOs are latencies for synchronous interactions, lag/delay in asynchronous interactions (where some queueing is involved), and error rates for availability tracking.
There is more to cover, so I will dedicate more posts to this later, but the essence of this could be summarized in these steps:
- Establish SLOs for major system scenarios (be customer-centric)
- Collect the needed data to track the proper metrics for your SLOs
- Define your thresholds based on the SLOs and set up automated alerts
- Schedule a (bi-)weekly operation review and observe trends
Recover
Once you get an alert or discover an issue, how long does it take you or your team to fix it and recover from it? This is is your incident management process which should be documented. The goal is to return the system to its nominal state that the customers expect. You know the system has recovered when you are not breaching your SLOs anymore. If you have a service level agreement (SLA), then you also need to respond to a certain issue within a certain timeframe. A good incident management process should include:
- Who should be notified and what is the escalation path
- How to assess severity and customer impact of issues
- Where to communicate and who needs to know about updates
- Resources to assist during the investigation
- How to get help if needed
The initial goal is to mitigate the issue and not necessarily find the root cause. For example, if rolling back a recent change fixes the problem, then do not wait until you figure out the ultimate root cause. The customer does not need to wait until you figure that out, make the code change, review it, build it, then deploy it. Sometimes, that is the only fix, but do not miss out on quicker mitigation options. You should have playbooks documented for common scenarios. For example, how to troubleshoot a service latency issue, or how to add extra storage to your database. Provide links to the monitoring dashboards, hints to possible causes, list of commands or tools that are relevant in the troubleshooting process.
It pays to establish common and consistent standards for logging, monitoring, documentation, tooling, etc. This will help the on-call person get to where the issue is and fix it as quickly as possible. A happy customer is good for business.
Prevent
This is the most critical step and what distinguishes a culture of great operational excellence from one that is not. This step has two sides which could be summarized with these two questions
- What can you do to prevent incidents from happening in the first place?
- What can you fix after an incident so it does not happen again?
The first question is often addressed with the usual good engineering best practices. From unit/integration testing to code/design reviews, and progressive rollouts to production. The second question is about having an effective postmortem process where the root cause(s) are identified and action items are created to address them. Think of how the detection and recovery phases went during the incident and find ways to improve them. Maybe you were missing some alerts or needed better tools to assist in troubleshooting. Don’t forget to look at cultural root causes. Not all root cusses are technical in nature.
There is a lot to discuss on how to run an effective postmortem so I will dedicate a separate post on that.
Conclusion
To build a culture of operational excellence at your SaaS company, you should establish processes to embed these 3 steps: Detect, Recover, and Prevent into your team’s DevOps mindset. Measure each of these steps and track how well you are doing and continuously keep improving. Use every failure as an opportunity to get better.
Leave a comment