Dear valued Routific customers,
This morning at 5:47 a.m. PT, Routific experienced an outage that lasted for 48 minutes. During this time, most of Routific’s services were completely unavailable. The geocoder service was partially unavailable for 90 minutes before it was fully recovered.
First and foremost, we are sorry we failed you. We understand that any downtime can cause major disruptions to your business. This incident highlighted a few parts in our infrastructure that were lacking resilience and our team is working hard to resolve them.
Here’s what happened in more detail:
At 5:47 a.m. PT, due to some yet unknown DNS issue at mLab, we lost connection to our entire database cluster. Even the failover database was unavailable. This issue lasted for 37 seconds. We contacted mLab support and they acknowledged that multiple mLab customers were affected by this at exactly the same time. They have raised the issue with AWS.
How did a 37-second disappearance of our database turn into a 48-minute downtime at Routific?
This is where our stack failed. Our workers rebooted automatically when the connection with the database was severed, in an attempt to reconnect. We expected this to happen, but the problem was that mLab was still unreachable when the worker came back online. The worker tried to reconnect again and failed again upon initialization – and now was stuck doing nothing.
While we have fail-safes in case the connection breaks, we didn’t have fail-safes in case the initialization of the connection failed 🤦.
All we needed to do was to trigger a manual reboot of the workers that were affected, and the jobs started to flow again.
So why then, was the geocoder worker still partially down for 90 minutes in total? This could only have happened on Friday the 13th:
A few minutes before our mLab cluster disappeared from the grid, there was a spike in inbound traffic at Routific, which triggered our auto-scaler to spin up a few more instances. It just so happened that one of these instances finished deploying exactly during the 37-second window when mLab was unreachable.
And because of the aforementioned lack of retry upon initializing the connection, this worker – one out of the four workers in the microservice cluster – basically went on strike and sat idle until we discovered it 90 minutes later.
Because the load balancer works in a round-robin fashion, only one out of four requests to this service timed out, making it harder to notice and track down.
Here, we identified another crack in our infrastructure: the worker was still consuming messages from the message queue, but it wasn’t telling RabbitMQ that things were failing – no ack, nor nack – so the slack never got picked up by the other workers.
When we finally discovered that this worker was slacking off, we asked him nicely to get back to work, and no more balls have been dropped since.
tl;dr while the 37-second blip in the internet seems like an isolated and unavoidable occurrence, our infrastructure should’ve been able to recover automatically after the 37 seconds were up.
Our team is working hard to patch the holes that we uncovered, so that the next time this happens, it should only be a 37-second outage.
We are also going to review our monitoring and escalation processes, because we should have been able to react swifter to the situation – and communicate sooner to our customers.
I never considered myself a superstitious person when it comes to bad luck… 😥
Marc – Founder & CEO