Over the weekend, on Saturday September 19th, we upgraded some of our database servers from MongoDB 2.6.11 to MongoDB 3.0.6. We did this following recommendations from MongoDB's official release notes, documentation, and the consultant we engaged with, with a goal of leading to better performance and stability. Unfortunately, it failed at that quite spectacularly and we did a bad job stress testing the upgraded version before sending production traffic to it.
Being a B2B solution, our traffic over the weekend is much smaller than on any given weekday. After the upgrade, the CPU usage on our database boxes was higher than normal, but was still way within the acceptable bounds and overall seemed to be performing similarly to the pre-upgrade situation. Unfortunately, starting at the beginning of Monday September 2, CPU usage kept growing until it brought Close.io to a stop at about 8am PT. The load average on our database boxes became many times higher than the maximum load we've previously seen with the older MongoDB version - something we did not account for. Something was very wrong with the database.
We decided to shut down all the email syncing, indexing, and the workers performing background tasks, as well as block the requests coming via the API in an attempt to make the application usable and restore full functionality more quickly. However, the Close.io application still remained slow. Next, we tried quickly spinning up new database instances with more CPU and memory capacity, but we found a bottleneck in the network-attached storage from which we had to copy a backup of our data. In parallel, we also worked on downgrading back to the older MongoDB version. Finally, at 10am PT we completed a downgrade of MongoDB to the old version and things quickly stabilized.
What we could've done to prevent it
Our stress testing of the upgraded MongoDB version was not sufficient in replicating a high Monday morning load. Our contingency plans (what we'll do quickly if we see a problem) were also not good.
What we're doing next
We have aggregated all the logs and the monitoring data, and opened a ticket with the MongoDB consultants to get to the bottom of the CPU issues. More importantly, we'll never do another database upgrade without more complete benchmarking under heavy load, and we'll have a more thorough contingency plan in place so that we can fix the problem faster if an incident does occur.