Linux has a lovely software raid feature set with a ton of options and levels for just about any situation, however one thing that most people use it for is data retention when your hard disk does die (not if, when). With the new tools that are around these days, a lot of the documentation is out of date on how to check RAID arrays – and one of the worst things in the world is when you figure “it doesn’t matter that drive died”, whack in another clean disk and SURPRISE! you have another faulty disk!
So, how do you minimise the impact of failures?
1. Look at the smart tools. Take note of their values and get the drives to self-test on a regular basis
smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
2. Scrub your RAID array on a semi-regular basis. This forces the array to verify your array and make sure everything is ok.
echo check > /sys/block/md0/md/sync_action
3. Keep another backup. If your data is important, don’t rely on just a RAID array. Think about if the machine dies. Say a power supply dies and takes out your hard disks, your RAID is now useless. Invest in a good quality tape drive and have a regular backup schedule.
Nothing is 100% foolproof, but with a bit of thought before a failure can save you hours, sometimes days of stress and headaches. The server that this site is hosted on recently had a RAID1 fail. Most data was recoverable, however the system required 2 new HDDs. A nightly rsync run from this machine to another offsite system took the recovery time to 2 hours + data copying time. Very little was lost (I think we lost maybe 5 mailing list messages from the archives).
Oh, and if you need to repair your RAID array at any time, try using:
echo repair > /sys/block/md0/md/sync_action