Search Outage
Incident Report for Close
Postmortem

First and foremost: an apology to our customers

We at Close.io understand that our system is an integral part of many people's workflow. We know our customers rely on us for their sales process and more, and nothing's worse than not closing deals for reasons completely out of your hands.

We always want to be better, and the first step in being better is owning up to our mistakes: We are sincerely sorry we did not live up to the level of stability that we should have. We apologize for what we know was a major workflow disruption for many teams.

This post-mortem should serve not as justification, but as explanation of how things went the way they did. It will add clarity on our customers' eyes and will help us grow and learn from our mistakes.

Word of caution: in an effort to be as transparent as possible the following sections are (perhaps a bit too much) technical in nature. The summary of what happened is, basically:

  • During a migration we were overzealous on shrinking our search cluster
  • We lacked deep monitoring in said search cluster services
  • The issue was slow-creeping so it wasn't immediately visible after our migration, read the next section if you want to know more
  • We could've responded better while the failure was on-going, read the Timeline of Events section if you want to know more
  • We already identified what went wrong, and took some initial steps to fix it, read the Root Cause Analysis section if you want to know more
  • There are still a few action items we're working on so this doesn't happen again, read the Action Items section if you want to know more

The next sections explain the incident (and the background that led to it) with as much detail without being overly verbose as we can.

A bit of background

We recently moved our entire elasticsearch infrastructure from a cluster of servers in Amazon EC2 to a cluster of servers in Amazon VPC. The move also meant that the configuration of the servers changed: We were using now less servers although each of these had significantly more CPU power behind them. The end result was that we halved the quantity of servers but actually doubled the quantity of CPU cores used by our elasticsearch cluster.

Another recent change was to make our dedicated elasticsearch masters smaller. These masters are important but the actual work they perform is very lightweight.

Moving to the new servers had been happening for a couple of weeks already, but on late Monday 2/15 the new smaller masters inside the VPC took over. This was a week before the incident on 2/23 and we noticed no issues during that week.

What issues / components got impacted

Our elasticsearch cluster powers a lot of the search/reporting capabilities in our application, but basically:

  • Smart views / Leads search
  • Inbox
  • Reporting
  • Opportunities list page

API requests making use of any of the above features were also impacted.

Timeline of events

This is the constructed timeline of significant events on Tuesday 2/23 (all timestamps are in USA Pacific time)

  • 06:17 - New relic alerts on the number of errors being reported and a pagerduty alert is fired, StatusPage incident is created
  • 06:20 - 2 engineers start looking at the problem
  • 06:51 - We decide to restart the webservers in an attempt to free up connections
  • 07:28 - We identify the failure: masters are running out of memory
  • 07:35 - We loop in the rest of the engineering team
  • 08:00 - The cluster is restarted and recovery is taking place, cluster is in read-only mode
  • 09:55 - The cluster is green and writes are enabled
  • 10:44 - We see the queues growing up again, we stop most of the intra-traffic in the cluster to speed up recovery
  • 11:30 - The queues are under the critical threshold
  • 12:00 - New masters with more resources are added to the cluster
  • 21:00 - Maintenance starts for master failover
  • 21:35 - Failover is completed, we monitor the state of the cluster
  • 22:00 - Maintenance is completed

Root Cause Analysis

We have narrowed down the root cause to a few things:

  • The smaller JVM heap configured in the masters: We went from 7G of heap to 4G of heap
  • The JVM heap was configured to 4G because we reduced the instance size on the masters
  • We had tried to reduce the size of the heap before actually reducing the masters' instance size but the issue is a slow creeping one, the cluster didn't thrash itself until 8 days after
  • We didn't notice the cluster bad state because of lack of deep alerting on the JVM side. The breaking point was 8 days after the heap was shrunk but there were signs things were not all right a few days in advance
  • The members of our team that responded first to the emergency didn't loop the rest of the engineering team until much later
  • Once the entire engineering team was looped in we were slow to act upon doing what had to be done, meaning we were trying to salvage a lost situation instead of doing a full cluster restart and recover as quickly from it

All of our servers have regular OS-level monitoring but in this case we didn't notice the heap usage was consistently above 75% of the configured heap capacity. We went from 7G of configured heap to 4G of configured heap.

Our elasticsearch java settings make it so 75% is the java virtual machine fires up garbage collection calls to reclaim some space. Before the masters got shrunk, we'd consistently be below the 75% mark, but after a few days in the new smaller masters we didn't notice usage crossing the threshold and staying there. Put in another way: the new masters were busy firing up garbage collection calls but these would never reclaim sufficient memory to be under the limit where garbage collection is not triggered anymore.

Other shops run with much smaller clusters but due to the nature of our data and how our index is built, the mappings in our indexes are pretty big. This in turn means the amount of heap space needed for our masters to function properly needs to be bigger.

Action Items

We have a few action items in our pipeline, some of which have already been completed, some of which we will complete in the near future:

  • Rollback our masters' size and configured heap size (already done)
  • Improve monitoring for JVM-level metrics
  • Improve the resiliency of our cluster in case of failure scenario (already done)
  • Make sure the entire engineering team is contacted immediately during any major outage, even if a couple engineers are already working on the problem.
  • Upgrade our elasticsearch version for better stability
  • Expand our engineering team, specifically with an operations focus

Conclusion

At Close.io we're striving to be better and every single day we work towards a balance of having a stable and elastic infrastructure and team to respond better to any kind of situation. The balance is of course delicate but we will get there, and we will continue working towards giving our customers the best product in the market.

Posted Mar 02, 2016 - 11:45 PST

Resolved
Post-mortem to follow in a few days.
Posted Feb 23, 2016 - 12:49 PST
Monitoring
Let's try this again...

All search systems are back online and results should be up-to-date. We'll be monitoring all systems closely for the next hour to make sure everything is holding up.
Posted Feb 23, 2016 - 11:40 PST
Identified
Seeing a slowdown in index updates due to timeouts on our search masters. We're investigating the cause of these new timeouts.

Until the index queues are back to normal you will experience inaccurate search results and Inbox items may be out of date.
Posted Feb 23, 2016 - 11:06 PST
Monitoring
All search systems are back online and results should be up-to-date. We'll be monitoring all systems closely for the next hour to make sure everything is holding up.

Watch out for a post-mortem in the coming days outlining what happened and how we can prevent future issues like this from impacting your sales day.
Posted Feb 23, 2016 - 10:37 PST
Update
Indexing queues are being worked through and we should be up-to-date in 15 minutes or so.
Posted Feb 23, 2016 - 10:09 PST
Update
Email sending fully online now. We're now processing updates to our search index for all the changes that were missed in the last few hours. Search querying should be operational but you'll still see old data/attributes being used when performing searches against your leads.
Posted Feb 23, 2016 - 09:11 PST
Update
We're in the process of brining email sending back online and are working through the queue of messages which should be sent over over the next 10-15 minutes we estimate.
Posted Feb 23, 2016 - 09:05 PST
Update
We understand how important Close.io's uptime and stability is to your sales team and we have the entire team working as quickly as possible to bring the search cluster back to life. At this point we don't have a solid ETA but we hope it will be resolved within the next 1 hour.
Posted Feb 23, 2016 - 08:15 PST
Identified
We're adding more search routing servers to increase the capacity.
Posted Feb 23, 2016 - 07:18 PST
Update
Our search cluster is experiencing a spike in the number of connections and cannot respond to all the requests on time. We're investigating the root cause behind the connection spike.
Posted Feb 23, 2016 - 06:49 PST
Investigating
We are currently investigating this issue.
Posted Feb 23, 2016 - 06:16 PST