500 Errors
Incident Report for Close
Postmortem

Incident

On March 27 at 5:11am PT (12:11pm UTC) a faulty deployment process has finished, leaving some of the Close.io app servers in an inconsistent state. That caused the majority of users' requests to fail for the next 29 minutes, until 5:40am PT (12:40pm UTC).

Investigation

While investigating the issue, we realized two things:

  1. Some of the installed packages didn't play well with our monitoring agent.
  2. Due to a bug in the deployment process, our app servers could be left in an inconsistent state. This time it happened because our GitHub repository was unavailable (due to GitHub suffering from a DDoS attack), but it could happen for several other reasons, too.

Both of these issues resulted in our app servers responding to your requests with errors.

Resolution

We were able to identify the issues and make a re-deployment, ensuring that all the servers are in a consistent state and that they run non-conflicting versions of 3rd party packages.

What we're doing to prevent problems like this

We've already started working on fixing the issues that surfaced during this incident. Our main priority right now is to ensure that errors like this will never happen again. We will achieve that goal by 1) Introducing a more robust staging environment, and 2) Making our deployment process atomic and more resilient to external issues.

We truly apologize for limited availability of Close.io during this incident.

Posted Mar 28, 2015 - 09:54 PDT

Resolved
Everything is back to normal now. Very sorry for the server errors. We'll do a Postmortem soon.
Posted Mar 27, 2015 - 05:38 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 27, 2015 - 05:25 PDT