We at Close.io understand that our system is an integral part of many people's workflow. We know our customers rely on us for their sales process and more, and nothing's worse than not closing deals for reasons completely out of your hands.
We always want to be better, and the first step in being better is owning up to our mistakes: We are sincerely sorry we did not live up to the level of stability that we should have. We apologize for what we know was a major workflow disruption for many teams.
This post-mortem should serve not as justification, but as explanation of how things went the way they did. It will add clarity on our customers' eyes and will help us grow and learn from our mistakes.
Word of caution: in an effort to be as transparent as possible the following sections are (perhaps a bit too much) technical in nature. The summary of what happened is, basically:
The next sections explain the incident (and the background that led to it) with as much detail without being overly verbose as we can.
We recently moved our entire elasticsearch infrastructure from a cluster of servers in Amazon EC2 to a cluster of servers in Amazon VPC. The move also meant that the configuration of the servers changed: We were using now less servers although each of these had significantly more CPU power behind them. The end result was that we halved the quantity of servers but actually doubled the quantity of CPU cores used by our elasticsearch cluster.
Another recent change was to make our dedicated elasticsearch masters smaller. These masters are important but the actual work they perform is very lightweight.
Moving to the new servers had been happening for a couple of weeks already, but on late Monday 2/15 the new smaller masters inside the VPC took over. This was a week before the incident on 2/23 and we noticed no issues during that week.
Our elasticsearch cluster powers a lot of the search/reporting capabilities in our application, but basically:
API requests making use of any of the above features were also impacted.
This is the constructed timeline of significant events on Tuesday 2/23 (all timestamps are in USA Pacific time)
We have narrowed down the root cause to a few things:
All of our servers have regular OS-level monitoring but in this case we didn't notice the heap usage was consistently above 75% of the configured heap capacity. We went from 7G of configured heap to 4G of configured heap.
Our elasticsearch java settings make it so 75% is the java virtual machine fires up garbage collection calls to reclaim some space. Before the masters got shrunk, we'd consistently be below the 75% mark, but after a few days in the new smaller masters we didn't notice usage crossing the threshold and staying there. Put in another way: the new masters were busy firing up garbage collection calls but these would never reclaim sufficient memory to be under the limit where garbage collection is not triggered anymore.
Other shops run with much smaller clusters but due to the nature of our data and how our index is built, the mappings in our indexes are pretty big. This in turn means the amount of heap space needed for our masters to function properly needs to be bigger.
We have a few action items in our pipeline, some of which have already been completed, some of which we will complete in the near future:
At Close.io we're striving to be better and every single day we work towards a balance of having a stable and elastic infrastructure and team to respond better to any kind of situation. The balance is of course delicate but we will get there, and we will continue working towards giving our customers the best product in the market.