Backups and Filesystem Deduplication


I have quite a few goals that I wanted to achieve. This includes:

  1. Always be able to retrieve a 'full' system restore.

  2. Only do an incremental backup on a daily basis.

  3. Support compression and optionally encryption during file transfer.

  4. Be easy to maintain.

I looked at Bacula, however there are a number of issues I encountered - mainly that all backups are job based (vs file based) meaning that jobs can expire and copies of files will not exist on the backup system until another backup is run. This can be hacked around by doing VirtualFull backups - however this is rather nasty imho.

I ended up doing a script that uses the standard rsync tool to sync remote systems to a local storage area. I also keep hardlinked copies of the system that are rotated nightly. This means I have $hostname/0 which is the latest FULL copy of the system, and $hostname/1-7 which contains the last 7 days of backups. The advantage of using hard links is that if a file is unchanged, it is linked in place - meaning no extra space is used (7 copies of the same file take the same space as 1 copy!). The end result of this means that I end up backing up multiple systems over multiple days in very little disk space. I coupled this with forced compression on a BTRFS filesystem - which saves ~15% extra space.

Now the big challenge. The majority of systems are RHEL (well, Scientific Linux to be exact) - which has MANY duplicate files. We're talking kernel files, libraries etc that we can't automatically hard link in place. While BTRFS promises a native deduplication process, it seems to be VERY slow (although when it does work, it should be nice). I tried quite a few different dedupe processes, but nothing really worked well for me... So I wrote my own.

You can restrict rsync access to certain ssh keys by using the rrsync script - usually distributed with the rsync package.

There is a key part in this though - we want to make sure we store the permissions of files on the system we back up. There is an rsync tool called file-attr-restore, normally in /usr/share/doc/rsync/support/ that takes the output of find / -ls and restores those file permissions to a filesystem. I create an archive of this by running find / -mount -ls | gz -9 > /file_list.gz on each system to be backed up. This creates an archive of permissions on the system for easy restoration.

Download: rsync_backup

Download: dupe-remove

dupe-remove will allow you to run on any filesystem supporting hard links (in linux at least) and uses a bit of logic to try and make things happen quite fast (compared to others). When it first starts, it builds a list of files and grabs the filename, size and inode number. As I first sort with a file list indexed by inode number, we can easily ignore existing hard links and not have to process them with logic further on.

We then look at the filesizes of files - as logically, a file can only be the same if the filesize is also the same. This rapidly cuts the workload down by orders of magnitude. We end up with a list of files in memory and we can start hashing files (I chose SHA1) with the same filesize looking for exact matches. Then a simple walk of the final list will do the hard linking and free up a ton of space.

Example output:

# time ./dupe-remove.pl
Sun Aug  3 16:41:48 2014: Deduplication Process started
Sun Aug  3 16:41:48 2014: Building list of files.....*snip*...........
Sun Aug  3 17:04:48 2014: File list complete!
Sun Aug  3 17:04:48 2014: Filtering file list...
Sun Aug  3 17:04:48 2014: Done!
Sun Aug  3 17:04:48 2014: Performing quick hash............................
Sun Aug  3 17:09:38 2014: Done!
Sun Aug  3 17:09:38 2014: Filtering quick hashes...
Sun Aug  3 17:09:38 2014: Filter complete!
Sun Aug  3 17:09:38 2014: Calculating Full Hashes...
Sun Aug  3 17:10:11 2014: Full Hashes Complete!
Sun Aug  3 17:10:11 2014: Sorting Hashes...
Sun Aug  3 17:10:11 2014: Hash Sort Complete!
Sun Aug  3 17:10:11 2014: Creating hard links....
Sun Aug  3 17:10:16 2014: Hard links Complete!
Sun Aug  3 17:10:16 2014:

DEDUPE SUMMARY:
Sun Aug  3 17:10:16 2014: Number of files checked: 461016
Sun Aug  3 17:10:16 2014: Number of files Quick Hashed: 12737
Sun Aug  3 17:10:16 2014: Number of files Full Hashed: 362
Sun Aug  3 17:10:16 2014: Hardlinks Created: 269
Sun Aug  3 17:10:16 2014: Total Logical bytes: 574,792,681,126
Sun Aug  3 17:10:16 2014: Total Physical bytes: 79,456,840,322
Sun Aug  3 17:10:16 2014: Bytes Saved: 65,868,768

real    28m28.624s
user    0m49.763s
sys     3m12.311s

This turns out to be MUCH faster than any other solution I have found so far. As always, feedback on code is most welcome :)

Updates: 03-Aug-2014: Updated dupe-remove script. Changes include:

  • Do quick-hashes on large files (up to \$quick_hash_kb) to trim large files from our lists quickly
  • Remove the shell call to get the date in the logger function
  • Use the perl glob to build the \@backup_paths array and check that it actually works
  • Extra stats printed on completion
  • Remove the useless scalar function in filtering the arrays
  • Now logs all actions to \$logfile to assist in verifying operation
  • Change the inode hash sorting to hopefully minimise disk seeking
  • Add counters for physical size, logical size and bytes saved by linking

Comments

Comments powered by Disqus