Note: This is an older post from medium which I've ported here.
It's come up a lot recently, and I want to put down a step by step guide on how you can set yourself up for success, and then troubleshoot services during an outage. This is not an exhaustive list, but it's built from my own experiences.
There are a few items which I would consider absolute requirements for organizations that want to resolve issues quickly. If you are already doing these items, you should be able to follow the steps. If you are not, these should be your next priority maintenance work.
First and foremost is logging. Specifically request and error logging. Transactional tracing is also nice to have though unnecessary in most cases (if you can build it in such a way where you can turn it on during an outage that's perfect).
In addition to traditional error logging, I would suggest enabling contract validation in APIs to ensure that required parameters are submit when needed and responding with the appropriate status code (400) if they are not with messaging explaining the issue with the request (and logging the same as a warning).
Equally important to writing logs is the collection of them. They do no good sitting on some server. There are a number of tools that can be used for log aggregation. Cloudwatch (aws's visibility tool) is a favorite of mine if you don't have too many AWS accounts. Otherwise there are free tools (the ELK stack) or paid ones (Splunk). If you have a large company with a large amount of logs that need to be consumed and aggregated, I would go with a paid product otherwise you'll likely spend more in engineers to optimize an open source package.
Change control is vitally important. In my experience outages are generally triggered by three kinds of events.
The first option is generally (like 90% or more) the issue. It is therefore extremely important that every change to a production system is recorded and acknowledged by a responsible party. Responsible party in this case being someone meant to understand the change and the business need for the change, not the person responsible in the case of a problem.
Recording these changes in a place that is centrally available is an important part of change control, as it ensures that visibility exists across an organization in case of outages.
In order to start working on an outage you need to learn about it. If the error happens right at the edge (as in your clients), they need an appropriate way to inform you in the case of errors. To this end, I would ensure a link to instructions in the case of error remain on the screen at all times.
If your service starts reporting a bunch of errors, ensure that you have appropriate alerting for your on-call to be notified (if your contracts support on-call). To support this, ensure that only actual errors are alerted on. 200, 300, and 400 level status codes are generally a non-issue to that point. Though getting a bunch of 400's can mean that your user experience is lacking.
Alerting should also be triggered by usage metrics. High amounts of CPU and memory usage can be a flag for an issue. This is especially true in environments without ephemeral compute (as in servers that are kept around).
The very first thing to do once you've got your trigger is identifying whether or not your specific app is the "root cause" of the outage. Depending on the trigger you should take different steps.
Starting with the easiest (or most difficult). If you have high resource utilization and it's an appropriate time to have high usage. That's a message to either scale out, or optimize your application. If it's not a normal high usage time, check your metrics on requests. If your requests are high, it can be an outlier, think about scaling out.
If your requests are not high it could be one of two things most likely. The first is you have a bug that is causing the issues, in which case review your recent changes and execute load tests to find the bug. The other could be an upstream system taking too long to respond to requests.
If an upstream takes too long (say five seconds when the normal is 100ms). You can get backed up in your requests. This can cause memory and CPU to go out of control. Finding this particular issue generally requires tracing tools such as xray (an AWS offering), datadog or dynatrace.
You can manually find that information out if the metrics information is aggregated and you look for load balancer request times for your upstreams but that takes a relatively long time.
If what you've got are a bunch of errors, simply look at the errors. If they are errors in calling upstreams, or consuming upstream data, then look at that upstream. Otherwise check your change review for changes.
If your clients are reporting errors and you do not see any errors in your logs, generally that indicates a failure in accessing your service. Ensure your service is actually running, and it if is, it's indicative of a problem between the ingress to the network and your service. Having external health checks can make diagnosing such issues quicker.
If you are using fully hardware to ingress into your service, check the various connections on their way to your service. If you are using some infrastructure as code, check for recent IaC pushes, and all the connection points.
This acts as a reset point if a troubleshooting path does not pan out. Simply head back here and start over.
If you are reasonably confident that your system is the root cause, move to the next step.
Assuming the root is still not found, there are two directions one can go. Upstream or Downstream. If your clients are reporting errors you do not have, head downstream.
Ideally for each point in the network chain to access your service, you will have the capability to execute into a shell at that point. Starting from the point of view of the client (as in whatever tool they use). Start testing your application. Assuming you can replicate their issue (which if you can't makes it significantly more difficult). Try to understand any patterns you may find based on the errors you receive.
You can then move up a point to the next network connection. An example of this is a load balancer. If you log in to the same network segment as the load balancer, and are now able to hit the service 100% of the time, you have identified the load balancer as the issue. Once you are able to find the root of the issue, head on over to the next step.
If the issue is upstream. The first thing you should do is contact the on-call that manages that upstream service. They can then help you examine the logs and metrics of the upstream you are trying to hit as well understand the nuance of how that upstream operates.
If the upstream is reporting errors, the on-call of the upstream should restart this document from the triage step. If the upstream is not reporting any errors, than perform the same network tracing as in the downstream test but going to the upstream. Once you find the root, either you or the upstream on-call should go to the next step depending on which area contains the issue.
The very first thing you should do is check if a change was made recently to the root cause system. If it has, and you are able, roll that change back. Unless it is not possible to roll back, do not roll forward. This will be the first instinct of engineers "just fix the problem real quick."
The reality is that folks don't normally push bugs to production on purpose. Pushing out a new change at a stressful time such as during an outage, increases the likelihood of poor quality changes making it through which may cause even more problems.
Something that many miss as part of this change checking are two kinds of sneaky changes. Changes to infrastructure, which can cause "hardware" or networking issues. As well as changes to dynamically loaded runtime configuration. This could be database fields that control the operation of the system as an example.
Ensuring those kinds of changes go through the same change control as code pushes will make it a ton easier to resolve issues. I cannot stress how often a non-code change brought down critical systems at companies I've worked at.
If no change was found, there are a couple directions you can take. The first, and most likely option, is that there was hardware failure. Using available logs and metrics, identify the problematic hardware and cordon of that hardware using the load balancer in order to operate on the hardware in safety. If the issue is something that can be easily fixed (such as a full disk drive), fix the issue and re-introduce the hardware to the load balancer making sure to watch logs. Otherwise schedule a replacement of the hardware.
In the case of ephemeral compute, simply delete the offender to be replaced with a new, hopefully working copy. This can often be done automatically, which is why I normally suggest organizations architect for ephemeral compute. If the new version does not start properly, there is likely an error which can direct you to the problem in the startup logs of the application.
In the case that it's not hardware failure and there is no change that can be found, there are generally two reasons. One is that there is just a bug that wasn't found until some trigger caused it to surface. The other is that there was a change either in the system or an upstream that wasn't recorded.
Either option will require engineers to examine the code surrounding the error using the available logs and stack traces to figure out the fix. This is the worst case scenario, and it may be discovered in the examination that an upstream may have something to do with the outage. In that case, restart this document to the narrowing step.
Assuming that a fix is identified, make sure that proper change controls are adhered to even during an outage. This is important as it may help one of your downstreams, and if your company is required to adhere to certain regulations (e.g. SOC 2) you are likely legally required anyway.
Outages can be extremely confusing. It's important to have one person directing troubleshooting efforts. It should be clear to everyone taking part in the outage who this person is. In the case this person changes for whatever reason, this change should be clear to everyone involved.
Having up to date architectural diagrams of each application can be extremely helpful in identifying networking issues. These can be relatively invisible otherwise.