April 25 outage postmortem

By Alexey Gavrilov on April 29, 2012

Zello has been growing like crazy recently — with number of active users increasing 23% on average each week since iPhone release. April 25 was starting great setting a new record of 222k people online at the same time and all systems running smoothly.

We first noticed the problem at 9:30 AM Pacific — signing into network was taking long time and the number of online user started to drop. Quick check confirmed that some of the supernodes handling user connections were down. Normally this wouldn’t be a problem because if one server dies, the users will automatically reconnect to a different one. It wasn’t happening though — the users couldn’t re-login creating a huge backlog, traffic and CPU load on the login server. At Zello we use membase┬áto store data, which require low-latency access. The failed server was also a part of a membase cluster, we created recently to handle increased load. This failure however had catastrophic effect making membase unresponsive and preventing new users from signing in.

Read the rest of this entry »