Server outage details.



On Friday around 3pm, I started to upgrade various packages on the server. The distro in use at the time was Fedora Core 2 - which had been out of the whole update scene for quite a while. This is something I wanted to correct. I started installing a few packages that would have little impact on operations, when everything stopped. As only my ssh session was responding (no web, no new ssh sessions etc) I told the box to reboot. At this point, the server did a kernel crash and refused to do anything. I called someone onsite to hard reset the server, and they watched the screen as it booted, however the kernel panic'ed on all reboots with an error in ext3.ko. This is where things get fun. It seems something (still unknown as to what at this point) corrupted around 91Mb total of files on the filesystem. One of these was ext3.ko - which made the box unbootable. At this point, I also redelegated the multiple domains on the server to another primary nameserver to stop having DNS issues with the primary nameserver offline. I then pulled the server out and brought it home to work on - and I figured that as it wouldn't boot at all, I'd upgrade it to the latest Fedora Core 4 packages as I went. It then turned out that the journal was also corrupt on the ext3 filesystems - and as it was the root drive, the system would not let me fcsk it without major hassles - and 91Mb worth of lost data. So, I booted of the FC 'panic' DVD and copied as much data as possible off the system, reformatted the whole thing and installed FC4. The rest went without a hitch. The fairly recent tape backup (done on the 19/2) restored without a hitch, and 95% of things were back to normal. This took between 7pm and around 4:30am Friday night/Sat morning. I had to work at 9am, so I did my 9->5 shift and then came home to work more in the server. As most of the data was repaired, I spent myself punishing the server to see if I could make it crash again - with no luck. Today, the server went with me to work (for another 9am -> 5pm shift) where I tried harder to make it crash (no success!) and finally tonight at around 7:30pm the server was put back online in it's new home in Collins St. Sorry to all for the outage. It wasn't something planned, however the backup system worked flawlessly to get the machine fully rebuilt and back online in a shade over 48 hours with minimal data lost. I'm currently working on improving the backup to a nightly setup at a remote location, to minimise data loss to under a day - however this is currently in the planning/testing stage.

Comments


Comments powered by Disqus