Degraded Performance on optimization stack

Incident Report for Routific

Postmortem

On November 4th 6:20 p.m. - 7:45 p.m. PST, Routific’s route optimization and SaaS service became partially degraded. 10% of optimization traffic was affected. The cause has been identified and fixed. Our system is now stable.

We're so sorry for all the stress that we've caused you, as we fully appreciate how important Routific's availability is to our customers.

What happened?

During the downtime, the infrastructure that runs Routific's optimization engine became unstable. Specifically, one of the many instances that process our optimization requests became unresponsive, and failed to be removed automatically from the system. Users might have experienced inconsistent, and unrecoverable job failures, as well as increased number of retries on the /vrp, /vrp-long, pdp, and /pdp-long endpoints.

What are we doing about it?

Our team has been committed to improving the reliability and stability of our systems. To prevent this incident from occurring in the future, we will look into better ways to improve infrastructure provisioning, and set up better tooling and protocol to mitigate infrastructure instability should they arise.

Posted Nov 07, 2019 - 16:33 PST

Resolved

Issue has been resolved. We will follow up with a post-mortem about the incident soon. So sorry about the trouble!

Posted Nov 04, 2019 - 18:30 PST