Performance degradation

Incident Report for Close

Postmortem

On August 5, Close.io started having moderate performance slow-downs at 12:50am PDT, which became significantly worse around 1:12am PDT and lasted until 3:52am PDT.

Between 5:30am-8am PDT another batch of moderate performance slow-downs occurred, and for about 20 minutes during this time they were extreme enough to make Close.io nearly unusable.

In total, for about 3 hours, Close.io was nearly unusable, and for another 2.5 hours its performance was significantly degraded.

We are extremely sorry for these issues and consider them unacceptable. We know that many teams rely on Close.io being stable and fast, and we let you down big time today.

We're working hard to make sure events like today's don't continue.

What happened

We identified the root technical issue behind the performance degradations. Our MongoDB cluster started using an inferior query plan for many queries, preferring a worse index over a more optimal one. MongoDB automatically periodically reevaluates query plans for queries to choose the optimal index, and this normally results in the most efficient way to execute queries. However on rare instances, due to what we believe is a MongoDB bug, it chooses an incorrect index for a specific query type. This problem eventually auto-corrects itself as it later reevaluates to the normal/correct/optimal plan, but in the meantime can result in performance degradation.

We've had similar issues in the last few weeks but, until now, it has only led to far less significant issues with much faster recovery. Still, we started executing on a plan (details below) to better understand the issue and avoid it. We took several steps in this direction, but unfortunately we didn't act fast enough. This morning's incorrect index issue was chosen on a much larger and frequently accessed collection, and took significantly longer to resolve itself through reevaluation.

Prior to today's incident, we engaged with an official MongoDB consultant to better understand this MongoDB issue and how it could be avoided. Their team already acknowledged our occasional-poor-query-plan issue as a problem and has been in the middle of investigating it in more detail. We expect to learn more within a couple days.

There are two main other steps for working around this issue:

Hard-coding index "hints" into our database queries, to prevent MongoDB from ever guessing the wrong index. This shouldn't generally be necessary, but is a measure to prevent the issue that we did start implementing throughout parts of our codebase recently. Unfortunately we hadn't yet covered the case for the specific queries that caused today's problems.
We believe the issue may be improved by a more recent version of MongoDB. We have an upgrade planned soon to take advantage of any recent bug fixes.

Once our team discovered the root issue this morning, we quickly tried hard-coding some index hints on the affected queries to restore the site's performance. While our quick-fix did work for most cases, it was an incomplete solution that caused more harm than good due to the edge cases it covered incorrectly. This is what led to the second batch of issues starting at 5:30am. Ultimately we reverted many of our quick-fix hints and allowed MongoDB to choose it's own index selection, which almost always chooses the best option.

Going forward we expect to overcome this issue through finishing our engagement with professional MongoDB consultants, implementing well-tested hard-coded "hints" where appropriate, and upgrading to the latest stable version of MongoDB.

What we really screwed up

A member of European support team discovered the issues around 2:53am PDT. Our US-based engineering team didn't awake to investigate until 3:35am PDT. This means that it took 2 hours of degraded performance before anyone on our team noticed, and almost 3 hours before our engineering/ops team started working to fix it or to communicate an incident on our status page.

While we take full responsibility for the database issues, at some level they will never be completely avoidable. But our slow time-to-respond is a very avoidable problem, and we feel especially terrible about this. We're now working hard to make sure our incident response time is never this bad again.

How our slow response time happened and what we're doing to fix it:

About a month ago we launched European support hours, but plan to soon extend these hours to start earlier in the European day. We have been primarily a US-based company thus far, but since we have customers in every timezone who equally deserve our attention, we're working hard to provide better support during non-US business hours.
We didn't properly train new hires on our support team in contacting the engineering/ops team when discovering issues when they are offline. We now have a written procedure in place to make sure that once anyone on our team knows about a serious issue that they can get ahold of the right teammates at any hour.
Our engineering/ops team is on call through PagerDuty to ensure that we're alerted to any emergencies and can be woken up if needed.
In the past couple days we started receiving a lot of not-critical alerts via PagerDuty. This meant we were being alerted to relatively minor issues with a specific component in the same way that we'd be alerted to critical performance or downtime events like this morning's. This led to alert fatigue and though our team awoke to the performance alerts in the middle of the night, the important alert was mixed together with the more-frequent non-critical alerts and were thus ignored. Obviously, this is a very bad thing and we're adjusting our alerts and policies such that we can take all alerts seriously.
Our response time suffered because our engineering/ops team is small and one member was on vacation. We're working hard on recruiting to add a few really great people to our engineering team. We're hiring for this role and others, especially a dev-ops role. Please get in touch if you're interested.

We're very sorry for today's issues and how we disrupted your work day. You rely on Close.io to make you better, and this morning we didn't fulfill that mission.

Please let us know if you have any additional questions or concerns.

– The Close.io Team

Posted Aug 06, 2015 - 05:03 PDT

Resolved

This incident has been resolved.

Posted Aug 05, 2015 - 05:10 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 05, 2015 - 04:39 PDT

Investigating

We are currently investigating this issue.

Posted Aug 05, 2015 - 03:38 PDT