Starting around 9:45 PT on April 6th, the IronWorker API started having issues where requests to the API would not receive a response and timeout. The issues were intermittent over the next few days with short outages followed by hours of normal operation. The cause of the issues were a couple infrequently used queries that weren’t using indexes, more information on that below. The issue was finally resolved around 2:45 PT on April 8th.
Note: this only affected the IronWorker API. IronMQ, IronCache, the IronWorker scheduler, and the IronWorker backend were not affected.
Why did it take so long to resolve?
The hardest part about resolving this was that the outages were very brief, 5-10 minutes, and occurred at random times with hours in between. The first time it happened, we thought it may have just been one of those random issues that can happen once in a blue moon and after reviewing all our monitoring tools, it didn’t seem like anything was wrong. The second time it happened, a couple hours later, we knew something was wrong so we dug in deeper. The team worked night and day to try to resolve this, but again, the hard part was that it was short outages with hours of uptime in between so every attempt to resolve the issue took hours to see if it worked. Also, we weren’t able to reproduce this on our staging environment.
Steps we took to resolve the issue
Here are the steps we took:
- Reviewed our monitoring tools
- Nothing seemed out of the ordinary here, our databases looked fine and all CPU/Disk/Memory metrics looked fine too.
- Reviewed code
- We reviewed our code. A couple of days before, April 4th, we had a pushed out a new release of the API with a very minor change (literally one line of code).
- We reviewed the code of our dependencies, meaning we went through commit by commit on Github to see if anything major changed.
- Rolled back recent changes
- We rolled back our code. This was unlikely to fix it due to the minor changes and having been tested thoroughly on staging before rolling out, but you can never be too sure.
- We rolled back our AMI to a previous AMI we’d used without issue.
- Reviewed how the IronWorker backend (which runs the actual jobs) was interacting with the API and how it had changed recently.
- We even rolled this back.
- Dug into database logs
- We checked our query logs and found two queries that were taking seconds to complete (which is waaaay too long). Bingo.
- These queries were locking up our database.
- We created indexes for these two queries.
- Problem solved.
After creating those indexes, the issue went away. It turns out these queries were rarely used which is why we hadn’t seen this issue for the past 2+ years that IronWorker has been running. But when suddenly they are used in a high load system, you have yourself a good old fashioned problem. As I always say, 90% of the issues you’re going to have with any application/service is going to be in your database. I should have listened to myself over the past few days and we probably would have resolved this sooner.
What We’re Doing About It
We always learn from our mistakes and we take outages, very, very seriously. We have already added what we’ve learned from this to our emergency response plan and we are also going to start automating our response plan wherever we can to remove the human element and increase our ability to catch these things faster and with more precision. We will also be doing a full code review to ensure all queries are covered by indexes.
Thanks to a few suggestions we are also committed to better informing customers in realtime during incidents, least of which has been the rollout of our brand new status.iron.io site. You can choose to be updated by SMS/email for all or selected events.