Route optimization service was partially degraded

Incident Report for Routific

Postmortem

On May 17 08:34 a.m. - 11:31 a.m. and May 20 03:12 a.m. - 07:20 a.m., part of Routific’s route optimization service was partially degraded. It has been stable since then. We would like to apologize for the lack of communication when the problem happened.

What happened on May 17?

On May 17 08:34 a.m. - 11:31 a.m., a lot of route optimization jobs failed to process because one of our instances became unavailable. On May 14, we shipped a new improvement to find more efficient route solutions for small problems. This increased the demand for computing resources and resulted in one of our two instances to fail on May 17 at 08:34 a.m. The remaining healthy instance was not able to respond to all job requests on-time. At 11:31 a.m., we identified the issue, removed the unresponsive instances and reverted our deployment.

What happened on May 20?

On May 20 03:12 a.m. - 07:20 a.m., jobs processing took a long time to complete due to a large number of job requests received. Our auto-scaling policy also failed to trigger to add new instances. At 07:20 a.m., the issue was identified and new instances were manually added to handle the load. We will keep these new instances running until we permanently fix our auto-scaling policy.

What are we doing about it?

Our team will be dedicated to improving our auto-scaling and add new metrics to our monitoring system for the next quarter. We are also working on moving our route optimization system into a new infrastructure to upgrade our systems reliability, availability and recoverability. Additionally, we are committed to improving our communication with our customers on the downtime incident reports.

Posted May 24, 2019 - 14:24 PDT

Resolved

On May 17 08:34 a.m. - 11:31 a.m. and May 20 03:12 a.m. - 07:20 a.m., part of Routific’s route optimization service was partially degraded. It has been stable since then.

Posted May 20, 2019 - 11:31 PDT