On the morning of Monday July 30th, 2018 starting at 10:25 a.m. PT our database servers were running at 100% CPU due to a large flurry of incoming requests. Because of this, access to the database was inhibited for about 90 minutes. It wasn’t until 12:00pm PT that things were stable again.
The underlying problem that caused the bottleneck was an inefficiently implemented API endpoint. This endpoint is used fetch the list of “All projects” in the Routific app. We use a mapReduce call on our database to generate the projects list; an approach that does not scale well with large data sets.
We have simulated the scenario of July 30th on our staging servers and we were able to reproduce the problem. This gives us confidence that we have indeed found the root cause.
On July 30th at 12:00pm PT we archived large amounts of data for some of our largest customers (with their consent). This is a temporary fix, because data can build up again.
As for a more permanent fix, the engineering team is working on a refactor of this endpoint, so it will be scalable and more efficient.