At 14:01 PDT Friday, April 7th our monitoring indicated that our Public API request rate dropped and health checks didn’t pass through. The situation deteriorated rapidly and we noticed that some of our API endpoints became unresponsive, which impacted the availability of the PandaDoc platform.
We followed our protocol and immediately started our incident response procedure, rolled back recent updates, and involved engineers in multiple investigation paths. After we had dismissed some initial theories, we understood that we had an issue connected with something on the infrastructure level and started investigating this together with our cloud provider (AWS).
After a deep investigation that lasted several hours, we were able to track down the issue to network problems: several pods on a specific Kubernetes node were experiencing intermittent low-level network issues that caused connection leaks - repeatedly opening connections without closing them, or at least closing only some of them - which eventually led to increased latency and memory consumption and resulted in some of our core services entering a chain of crashes. As a consequence, Application, and API were not available during the downtime.
Once the root cause was identified, the broken machine was removed from the cluster and the system started operating normally. The issue was fully resolved by 01:23 PDT, April 08.
When the incident started, we noticed a spike in the number of connections in our database pool and many API calls waiting for connections to be released so they could process incoming requests. We quickly figured out that what was stopping connections from being released was a large series of uncommitted transactions that were just waiting idle. We then started analyzing database locks and deadlocks because it is usually what might lead to this behavior and wrote a hotfix to one of our API endpoints to reduce the number of processed events expecting this would release connections faster.
Soon after this, we understood that the database was not a bottleneck, although we still had stalled transactions growing and the connections in the pool being taken and not released. We ran a deeper analysis of API endpoints metrics which revealed that external calls within the transactions could be the culprit. After more investigation, we found similarities in the API calls that were not responsive - they all interacted with our message queue (RabbitMQ HA cluster).
The RabbitMQ cluster was working without any disruptions for the last 1.5 years, and monitoring was not showing anything suspicious. It did not seem like a cause since queues are processing messages independently in async mode (that’s why they are used to offload tasks to be executed asynchronously later), but we still decided to look into it closer. After analyzing machines in the clusters and connecting to them directly we saw that they were shutting down and reloading periodically, although this was not visible in the cluster monitoring in our Grafana dashboards, nor did we get any alerts.
Since the message queue was unresponsive it led API calls to sit and wait for connection which led to blocking transactions which led to increasing in blocked connections in the database connection pool which led to waiting for other API requests for a new connection forever, in a loop that caused a chain of failures. We immediately started addressing the situation by scaling the cluster up vertically and upgrading the machines it's running on with more processing power and networking capabilities. After the upgrade, we’ve added additional monitoring metrics to the cluster.
In parallel, we were leading an investigation into a probable root cause: intermittent networking issues on a kubernetes node that were causing pods on that node to repeatedly open connections without closing them. We investigated deeper and realized that an underlying networking issue was the most probable root cause after we observed and correlated several facts:
Once the broken machine was removed from the cluster, the system started operating normally.
To sum up: we consider the main cause of the incident to be a problem with an AWS EC2 instance provisioned as part of our EKS (managed k8s cluster) that occurred during the normal process of a release. There were network-related errors that caused a number of connection issues on the RabbitMQ cluster leading to a chain failure.
As our investigation wraps up, we want to highlight our continuous improvement mindset, and to provide clarity on what we are doing to improve our systems: