On September 19th, Routific was unavailable for 84 minutes between 12:39 p.m. - 2:05 p.m. Pacific Time. The cause has been identified and our systems have been stable since.
What happened?
The messaging queue system that our geocoding service is dependent on went down, which subsequently made our platform unavailable. We did not correctly configure our messaging queue to be highly available which resulted in one failed instance bringing down the entire system.
Resolution
We immediately spun up a new messaging queue to replace the old service. We are still working closely with AWS support to investigate why the messaging queue failed.
What are we doing about it?
We are now working on deploying our messaging queue system to be managed by a third-party queueing service to ensure future scalability and high availability. In addition, our engineering team will work on decoupling our system so that failure on part of our system will not affect the rest of our platform. We will also improve our failover processes to ensure uptime.