Sunday, November 16, 2008

Catastophic Server Failure

Just putting out the word for those that may have noticed; Flickspin died for about 9 hours starting at 7:35pm Saturday Nov 16. The failure was due to a script that went horribly wrong and seemingly blind-sided MySQL the DBMS behind the site.

It turned out the be a Linux permission issue and I was none the wiser and began tearing at the OS to try and bring it back but to no avail. I tried restoring a backup EC2 instance only to find my ssh keys had changed since I created the backup and so I could no longer gain access to it.

In the end I rebuilt an EC2 instance and reinstalled all the applications and restored the code and the database.

Fortunately no data was lost and I have rsync and mysql replication to thank. Having the site down briefly, let alone hours, is not something I take lightly. I put a lot of work into avoiding this kind of failure and it just goes to show that you really don't know your backup system until you really need it. And more often than not it's going to fail.

I'm happy to have the site back up and going but things are still a bit fragile until I can spend some time on some smaller issues tomorrow.

Right now though, I think I need some sleep. It's now 5am.

Good night guys.

No comments: