Application Unavailable
Incident Report for Close
Postmortem

On Friday Apr 7, 2017 for approximately 30 minutes starting at 10:42am PDT, Close.io had an issue preventing proper loading of our app UI (web app & desktop apps). For users who already had the app (web or desktop) loaded on their screen, Close.io continued to function without issue. However opening new windows/tabs of Close.io or refreshing the page would not work properly due to a JavaScript error. Our API was unaffected.

This incident started after a deployment of completely unrelated backend code went wrong. The deployment prior to that was a change to how JavaScript dependencies are loaded into Close.io, to help with our goal of significantly reducing page load times. This prior change also introduced caching of webpack assets between builds to speed up developer wait times and deployment times.

This prior deployment was tested in both a development and staging environment and had been deployed to production for 3 hours without issue before the incident began. However, due to a flaw in the new webpack static asset caching between builds, the next (unrelated) deployment led to deploying a broken static asset manifest.

Once deployed, we immediately spotted the issue and started working to rollback the problematic code. Within about 15 minutes some new loads of Close.io began working properly. Within about 30 minutes, all servers had been fully reverted to non-broken code and the incident was resolved.

We're extremely sorry for this outage and are doing everything we can to help make sure this type of problem doesn't occur again. These action items include:

  • More carefully testing multiple deployments whenever between-build caching is changed.
  • Continued work toward speeding up our deployment process so issues in production can be more quickly rectified once discovered.
  • Canceling deployment when a problem with our asset manifest is detected.
  • Introducing additional integration tests as part of our build & deployment process so that we can be more confident that a page is rendered properly without any JavaScript errors before we continue deployment.
  • Considering introducing automatic canary deployments so that all deployments are tested automatically on a subset of traffic (and can be canceled if something goes wrong) before being deployed more widely.
Posted Apr 10, 2017 - 13:23 PDT

Resolved
This incident has been resolved.
Posted Apr 07, 2017 - 11:39 PDT
Monitoring
We have reverted a change made to the application. If you are having trouble accessing the application please restart the close.io application or hard refresh https://app.close.io if using the web
Posted Apr 07, 2017 - 11:09 PDT
Investigating
We're investigating an issue with accessing the application.
The engineering team is currently looking into resolving this as quickly as possible.
Posted Apr 07, 2017 - 10:43 PDT