Zello has been growing like crazy recently -- with number of active users increasing 23% on average each week since iPhone release. April 25 was starting great setting a new record of 222k people online at the same time and all systems running smoothly.
We first noticed the problem at 9:30 AM Pacific -- signing into network was taking long time and the number of online user started to drop. Quick check confirmed that some of the supernodes handling user connections were down. Normally this wouldn't be a problem because if one server dies, the users will automatically reconnect to a different one. It wasn't happening though -- the users couldn't re-login creating a huge backlog, traffic and CPU load on the login server. At Zello we use membase to store data, which require low-latency access. The failed server was also a part of a membase cluster, we created recently to handle increased load. This failure however had catastrophic effect making membase unresponsive and preventing new users from signing in.
At that point the service was still operational for the most clients. We started the reboot of a failed server and removed it from the membase cluster. That didn't work though as the amount of RAM available to the membase dropped below threshold where it started to extinct items from memory to the disk again slowing everything down. I think we ran into membase bug here because the number of objects stored in the base doubled without any reason in just a few minutes further pushing the limits of the system.
Membase being unresponsive for prolonged time caused subsequent failure of more supernodes ending up with the system when no users are online, everyone trying to re-connect and membase timing out on the most of requests. No one could connect, even the supernodes.
To restore the system we decided to use firewall to block all users and only allow local traffic to let supernodes connect. It worked and soon we had the system operational albeit with no users connected. Then we started enabling users gradually 10 IP blocks at a time. All looked great initially but when about half of the users returned online we noticed the system is running slow again. Server logs showed large number of TIMEOUT and MEMORY ALLOCATION FAILED errors from membase. The errors were coming from the bucket storing transient values so the simplest solution was to flush it, recreate and restart the network (meaning we had to go through disable IPs / re-enable gradually again). Unfortunately it didn't work -- the same errors were still popping in the server log as if nothing changed. We tried to re-add the server to the cluster but re-balance would take forever and fail at 75%.
As we found later the problem was caused by moxi, which is a part of membase server. Restarting moxi fixed server errors, and we were able to finally bring the network online -- after 20 hours of continuous work, thousands of support e-mails and hundreds of 1-star ratings from enraged users.
Here are some takeaways for me:
1. Sharing hardware between system components could have dangerous side-effects when you deal with high load.
2. Updating server components (membase in our case) could be painful but necessary. 4 hours of scheduled maintenance is better than 20 hours outage.
3. If things go really bad use firewall to block everyone, restore the system health, then gradually let users in unblocking specific network ranges.
4. Make sure to identify and fix the root cause before doing #3 or you risk having to do it again