The following is the incident report for the company-wide service outage that occurred between May 11, 2019 00:10 UTC and May 11, 2019 01:50 UTC
When I open my Master Tour desktop application, I'm told that I've been removed from my organization, and when I log back in my data is gone
Around 00:10 UTC on May 11, 2019, error reports began alerting the development team of an issue - all of them in blocks of 5000 errors (meaning everyone hitting them at the same time) bouncing between, "Error Opening DB: mysqli_connect(): (HY000/2002): Connection timed out" and "Error: Request failed with status code 412" - both of which are atypical of normal operation. Upon investigation, it was revealed that the services data points were heading towards a flatline. At this stage a company-wide alert was issued.
The issue began intermittently causing confusion as to the root of the problem - users could login to my.eventric.com (the "portal") and see all information. Data in the Master Tour 3 client could be successfully updated, auto-synced, and then seen on the portal - but this was only for some users and not others. Finally, all users were experiencing these same errors - this was due to some requests successfully routing to the database, while other requests could not. At this stage, with emails and phone calls beginning to pour in, it was determined that maintenance mode be turned on - blocking all write traffic, potential client-side organization removal, and syncing so the issue could be properly investigated. The status page is updated accordingly.
It was soon discovered that the API server was returning 412 HTTP status codes, while there was no connectivity to the actual database. 412 signals to the desktop client that it no longer has permission to the organization it's trying to sync and thusly removes all local data belonging to that organization. It was posited that this was happening to users attempting to sync in organization 0, the stock organization that everyone belongs to, and that this would cause all local information to be removed and the client to be rendered useless. This fear was being propagated by screenshots we were seeing from users reporting in:
Furthermore, as additional user complaints began to inform support that users were seeing messages that they were removed from their organization, it was determined that the 412 status codes were being returned for requests outside of organization 0 and could be affecting any organization.
While the network was being investigated for issues, it was discovered that our service provider, Amazon's AWS, had reported a major outage with their DNS resolutions, the most damning being the routing to their RDS auto-scaling database instances, of which Master Tour utilizes.
This was then verified against Eventric's internal network: a proper IP address would be returned to our database servers one second, the next it wouldn't return anything or return foreign IP addresses.
It was then determined that the DNS outage was still affecting the load balancer that connected the API servers to the read-only database instances. To correct this, it was determined that instead of a load balanced solution a single, powerful, dedicated read-only database instance should be used until connectivity had solidified. A read-only database 4x the normal size of a normal one was spun up and used as the direct connection for all API server instances from there out.
At this stage, the outage was confirmed patched from all Eventric staff who were nice enough to jump online on their Friday night to help test the desktop syncing. Post-mortem support requests numbered between 100 and 200, with the outage only being felt by those who had either the desktop or mobile applications open and being checked during the outage.
|May 11th 00:10||Initial discovery of an outage reported by bugsnag through the slack channel|
|May 11th 00:32||Discovery of AWS' network issues and evidence produced Various debugging of the API server and the network in an attempt to get connectivity restored|
|May 11th 01:50||The outage has been patched, and the status page updated. Tweets go out giving a simple explanation. Maintenance mode is turned off.|
AWS DNS outage, coupled with the API server mistaking a database connectivity issue with an organization removal.
While the API server was patched to never send a 412 status code on the stock organization. A 4x dedicated read-only was spun up to weather the AWS outage. Transparent communication and maintenance windows were put into proper usage, and an abundance of support help was lended by all those in the company who saw the alerts.
Following the resolution of the AWS outage, the scaling AWS endpoint has been implemented to appropriately handle influxes of user load.
The API logic has been patched to never send a 412 status code on the stock organization data.
API logic is currently being improved to prevent returning a 412 status code for any reason outside of permissions not existing for the organization the user is attempting to interact with.