On August 5, Close.io started having moderate performance slow-downs at 12:50am PDT, which became significantly worse around 1:12am PDT and lasted until 3:52am PDT.
Between 5:30am-8am PDT another batch of moderate performance slow-downs occurred, and for about 20 minutes during this time they were extreme enough to make Close.io nearly unusable.
In total, for about 3 hours, Close.io was nearly unusable, and for another 2.5 hours its performance was significantly degraded.
We are extremely sorry for these issues and consider them unacceptable. We know that many teams rely on Close.io being stable and fast, and we let you down big time today.
We're working hard to make sure events like today's don't continue.
We identified the root technical issue behind the performance degradations. Our MongoDB cluster started using an inferior query plan for many queries, preferring a worse index over a more optimal one. MongoDB automatically periodically reevaluates query plans for queries to choose the optimal index, and this normally results in the most efficient way to execute queries. However on rare instances, due to what we believe is a MongoDB bug, it chooses an incorrect index for a specific query type. This problem eventually auto-corrects itself as it later reevaluates to the normal/correct/optimal plan, but in the meantime can result in performance degradation.
We've had similar issues in the last few weeks but, until now, it has only led to far less significant issues with much faster recovery. Still, we started executing on a plan (details below) to better understand the issue and avoid it. We took several steps in this direction, but unfortunately we didn't act fast enough. This morning's incorrect index issue was chosen on a much larger and frequently accessed collection, and took significantly longer to resolve itself through reevaluation.
Prior to today's incident, we engaged with an official MongoDB consultant to better understand this MongoDB issue and how it could be avoided. Their team already acknowledged our occasional-poor-query-plan issue as a problem and has been in the middle of investigating it in more detail. We expect to learn more within a couple days.
There are two main other steps for working around this issue:
Once our team discovered the root issue this morning, we quickly tried hard-coding some index hints on the affected queries to restore the site's performance. While our quick-fix did work for most cases, it was an incomplete solution that caused more harm than good due to the edge cases it covered incorrectly. This is what led to the second batch of issues starting at 5:30am. Ultimately we reverted many of our quick-fix hints and allowed MongoDB to choose it's own index selection, which almost always chooses the best option.
Going forward we expect to overcome this issue through finishing our engagement with professional MongoDB consultants, implementing well-tested hard-coded "hints" where appropriate, and upgrading to the latest stable version of MongoDB.
A member of European support team discovered the issues around 2:53am PDT. Our US-based engineering team didn't awake to investigate until 3:35am PDT. This means that it took 2 hours of degraded performance before anyone on our team noticed, and almost 3 hours before our engineering/ops team started working to fix it or to communicate an incident on our status page.
While we take full responsibility for the database issues, at some level they will never be completely avoidable. But our slow time-to-respond is a very avoidable problem, and we feel especially terrible about this. We're now working hard to make sure our incident response time is never this bad again.
How our slow response time happened and what we're doing to fix it:
We're very sorry for today's issues and how we disrupted your work day. You rely on Close.io to make you better, and this morning we didn't fulfill that mission.
Please let us know if you have any additional questions or concerns.
– The Close.io Team