Routific API outage
Incident Report for Routific
Postmortem

Overview

On the morning of Monday July 30th, 2018 starting at 10:25 a.m. PT our database servers were running at 100% CPU due to a large flurry of incoming requests. Because of this, access to the database was inhibited for about 90 minutes. It wasn’t until 12:00pm PT that things were stable again.

Root Causes

The underlying problem that caused the bottleneck was an inefficiently implemented API endpoint. This endpoint is used fetch the list of “All projects” in the Routific app. We use a mapReduce call on our database to generate the projects list; an approach that does not scale well with large data sets.

We have simulated the scenario of July 30th on our staging servers and we were able to reproduce the problem. This gives us confidence that we have indeed found the root cause.

Resolution

On July 30th at 12:00pm PT we archived large amounts of data for some of our largest customers (with their consent). This is a temporary fix, because data can build up again.

As for a more permanent fix, the engineering team is working on a refactor of this endpoint, so it will be scalable and more efficient.

Posted Aug 03, 2018 - 13:39 PDT

Resolved
We've monitored for the last hour, and things are stable again.

So sorry for the interruption :(
Posted Jul 30, 2018 - 14:50 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 30, 2018 - 13:22 PDT
Identified
We found roughly the cause, cleared up the DB read/write pipes, but still investigating the exact details to pinpoint what caused it.
Posted Jul 30, 2018 - 12:18 PDT
Investigating
The fix was temporary; the problem came back again. Continuing the investigation.
Posted Jul 30, 2018 - 11:13 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 30, 2018 - 11:04 PDT
Update
We have finished the roll back. Services are back online. We are continuing to monitor the systems closely.
Posted Jul 30, 2018 - 11:04 PDT
Investigating
The Routific API is intermittently down. We just made a new deployment that caused it, and we are in the process of reverting.
Posted Jul 30, 2018 - 10:52 PDT
This incident affected: Routific API.