Public API is down

Incident Report for PandaDoc

Postmortem

A summary of what happened

At 14:01 PDT Friday, April 7th our monitoring indicated that our Public API request rate dropped and health checks didn’t pass through. The situation deteriorated rapidly and we noticed that some of our API endpoints became unresponsive, which impacted the availability of the PandaDoc platform.

We followed our protocol and immediately started our incident response procedure, rolled back recent updates, and involved engineers in multiple investigation paths. After we had dismissed some initial theories, we understood that we had an issue connected with something on the infrastructure level and started investigating this together with our cloud provider (AWS).

After a deep investigation that lasted several hours, we were able to track down the issue to network problems: several pods on a specific Kubernetes node were experiencing intermittent low-level network issues that caused connection leaks - repeatedly opening connections without closing them, or at least closing only some of them - which eventually led to increased latency and memory consumption and resulted in some of our core services entering a chain of crashes. As a consequence, Application, and API were not available during the downtime.

Once the root cause was identified, the broken machine was removed from the cluster and the system started operating normally. The issue was fully resolved by 01:23 PDT, April 08.

A deep dive - how we investigated the root cause

When the incident started, we noticed a spike in the number of connections in our database pool and many API calls waiting for connections to be released so they could process incoming requests. We quickly figured out that what was stopping connections from being released was a large series of uncommitted transactions that were just waiting idle. We then started analyzing database locks and deadlocks because it is usually what might lead to this behavior and wrote a hotfix to one of our API endpoints to reduce the number of processed events expecting this would release connections faster.

Soon after this, we understood that the database was not a bottleneck, although we still had stalled transactions growing and the connections in the pool being taken and not released. We ran a deeper analysis of API endpoints metrics which revealed that external calls within the transactions could be the culprit. After more investigation, we found similarities in the API calls that were not responsive - they all interacted with our message queue (RabbitMQ HA cluster).

The RabbitMQ cluster was working without any disruptions for the last 1.5 years, and monitoring was not showing anything suspicious. It did not seem like a cause since queues are processing messages independently in async mode (that’s why they are used to offload tasks to be executed asynchronously later), but we still decided to look into it closer. After analyzing machines in the clusters and connecting to them directly we saw that they were shutting down and reloading periodically, although this was not visible in the cluster monitoring in our Grafana dashboards, nor did we get any alerts.

Since the message queue was unresponsive it led API calls to sit and wait for connection which led to blocking transactions which led to increasing in blocked connections in the database connection pool which led to waiting for other API requests for a new connection forever, in a loop that caused a chain of failures. We immediately started addressing the situation by scaling the cluster up vertically and upgrading the machines it's running on with more processing power and networking capabilities. After the upgrade, we’ve added additional monitoring metrics to the cluster.

In parallel, we were leading an investigation into a probable root cause: intermittent networking issues on a kubernetes node that were causing pods on that node to repeatedly open connections without closing them. We investigated deeper and realized that an underlying networking issue was the most probable root cause after we observed and correlated several facts:

We had randomly missing metrics in our Prometheus monitoring relative to several systems, coinciding in time with the degradation of the RabbitMQ cluster metrics (the number of sockets started growing linearly)
We found that all pods on one particular Kubernetes node (that was added to the cluster on Friday morning) were having trouble connecting to other parts of the system (our NATS cluster). We also noticed error patterns in logs related to closed network connections or client timeouts, in numbers higher than normal. At the same time, we also observed that the number of slow NATS consumers was growing abnormally since the start of the incident
Most of the connections to the RabbitMQ nodes during the incident period were coming from pods that were residing in the faulty node.

Once the broken machine was removed from the cluster, the system started operating normally.

To sum up: we consider the main cause of the incident to be a problem with an AWS EC2 instance provisioned as part of our EKS (managed k8s cluster) that occurred during the normal process of a release. There were network-related errors that caused a number of connection issues on the RabbitMQ cluster leading to a chain failure.

What we have done and will be doing next

As our investigation wraps up, we want to highlight our continuous improvement mindset, and to provide clarity on what we are doing to improve our systems:

We have improved the robustness and scale of our rabbitMQ cluster to reduce the likelihood of failure in case of a growing number of network connections and reviewed the HA RabbitMQ setup and its replication settings
We’ve added additional logging and metrics to our RabbitMQ cluster, as well as early detection alarms for any deviation in network traffic patterns for the cluster
We’ve engaged AWS in the investigation and resolution of this outage. AWS support is running their own investigation about the issue
We’ll do further improvements in our Observability stack, with a review of which additional metrics we can add to improve the detection of underlying problems in AWS-managed services (e.g. EKS), reduce alerting noise and ensure certain alerts are highlighted (RabbitMQ / failing pods)
As an additional step to prevent this in the future, we’re planning to review all the external calls in our API handlers and move them to a transactional outbox to avoid blocking transactions if external services become unavailable.

Posted Apr 17, 2023 - 09:00 PDT

Resolved

We're all set! If you continue to experience any issues with this, please reach out to us at support@pandadoc.com. Thank you so much for your patience and understanding!

Posted Apr 08, 2023 - 02:23 PDT

Monitoring

We have resolved the issue and PandaDoc should now be up and running! We're still monitoring the application performance and will post a final “ALL SET” message once we’ve confirmed the fix produced consistent output.

Posted Apr 08, 2023 - 01:23 PDT

Identified

We’ve identified the issue root and are already working on a fix. We appreciate your patience and will post updates here as soon as possible. Stay tuned!

Posted Apr 08, 2023 - 01:17 PDT

Update

Unfortunately, the services are still being impacted by the outage. Please rest assured technicians and engineers are digging into it in order to provide the fastest fix possible. Please check back here for updates, we appreciate your patience while we are getting this resolved.

Posted Apr 08, 2023 - 00:57 PDT

Update

Sadly, the website is still experiencing technical difficulties, the team carries on putting all the possible effort into the investigation and resolution. Thank you so much for your patience and understanding.

Posted Apr 07, 2023 - 21:00 PDT

Update

The development and engineering team is doing their best to resolve the issue. Please check back here for updates, and we appreciate your patience while we get this resolved.

Posted Apr 07, 2023 - 18:03 PDT

Update

We’re really sorry for holding you up! Please know our engineering and operations teams are working hard to get everything up and running.

Posted Apr 07, 2023 - 16:37 PDT

Update

We’re on it! Our team is doing its best to get you back on track as soon as possible. Please check back here for updates.

Posted Apr 07, 2023 - 15:40 PDT

Update

Thank you for your patience! Our development team is already hard at work solving this issue and determining next steps. Check back here for updates and we will get this back on track as soon as possible.

Posted Apr 07, 2023 - 14:57 PDT

Update

We are continuing the investigation and doing our best to get to a resolution as fast as possible. Please check back here for updates, and we appreciate your patience while we get this resolved.

Posted Apr 07, 2023 - 14:27 PDT

Investigating

We are actively investigating the outage of the public API. Please check back here for updates, and we appreciate your patience while we get this resolved.

Posted Apr 07, 2023 - 14:01 PDT

This incident affected: US & Global (Creating and editing documents, Sending and opening documents, Uploading and downloading documents, Public (recipient) view, Signup, CRMs & Integrations, API, Webhooks, Web application, Mobile application, Website).