Application Outage

Incident Report for Close

Postmortem

Over the weekend, on Saturday September 19th, we upgraded some of our database servers from MongoDB 2.6.11 to MongoDB 3.0.6. We did this following recommendations from MongoDB's official release notes, documentation, and the consultant we engaged with, with a goal of leading to better performance and stability. Unfortunately, it failed at that quite spectacularly and we did a bad job stress testing the upgraded version before sending production traffic to it.

Being a B2B solution, our traffic over the weekend is much smaller than on any given weekday. After the upgrade, the CPU usage on our database boxes was higher than normal, but was still way within the acceptable bounds and overall seemed to be performing similarly to the pre-upgrade situation. Unfortunately, starting at the beginning of Monday September 2, CPU usage kept growing until it brought Close.io to a stop at about 8am PT. The load average on our database boxes became many times higher than the maximum load we've previously seen with the older MongoDB version - something we did not account for. Something was very wrong with the database.

We decided to shut down all the email syncing, indexing, and the workers performing background tasks, as well as block the requests coming via the API in an attempt to make the application usable and restore full functionality more quickly. However, the Close.io application still remained slow. Next, we tried quickly spinning up new database instances with more CPU and memory capacity, but we found a bottleneck in the network-attached storage from which we had to copy a backup of our data. In parallel, we also worked on downgrading back to the older MongoDB version. Finally, at 10am PT we completed a downgrade of MongoDB to the old version and things quickly stabilized.

What we could've done to prevent it

Our stress testing of the upgraded MongoDB version was not sufficient in replicating a high Monday morning load. Our contingency plans (what we'll do quickly if we see a problem) were also not good.

What we're doing next

We have aggregated all the logs and the monitoring data, and opened a ticket with the MongoDB consultants to get to the bottom of the CPU issues. More importantly, we'll never do another database upgrade without more complete benchmarking under heavy load, and we'll have a more thorough contingency plan in place so that we can fix the problem faster if an incident does occur.

Posted Sep 25, 2015 - 08:56 PDT

Resolved

Everything is back to normal. We apologize for the inconvenience. A postmortem will be published soon.

Posted Sep 21, 2015 - 10:49 PDT

Monitoring

We have brought up additional servers and stabilized the database performance. We're slowly re-enabling our workers and indexers.

Posted Sep 21, 2015 - 10:06 PDT

Update

In an effort to get things running quicker we're currently disabling API requests and have paused our workers including updates to search and email syncing.

Posted Sep 21, 2015 - 08:54 PDT

Identified

We've identified the cause and are bringing up additional server resources to deal with the issue.

Posted Sep 21, 2015 - 08:24 PDT

Investigating

We're investigating a partial outage. You may experience delays in email syncing, report updates, smartviews / search results, and bulk actions.

The engineering team is currently looking into resolving this as quickly as possible.

Posted Sep 21, 2015 - 07:57 PDT