Backups and Filesystem Deduplication

 

Recently, I’ve been looking at backup solutions to replace TSM. In my opinion, TSM is great for VERY large organisations, but versions beyond TSMv5 seem to be much more bloated than useful in smaller installs. There are a number of backup ‘solutions’ for Linux, however none seem to have a permanent and consistent state without doing various bits of magic.

I have quite a few goals that I wanted to achieve. This includes:
1) Always be able to retrieve a ‘full’ system restore.
2) Only do an incremental backup on a daily basis.
3) Support compression and optionally encryption during file transfer.
4) Be easy to maintain.

I looked at Bacula, however there are a number of issues I encountered – mainly that all backups are job based (vs file based) meaning that jobs can expire and copies of files will not exist on the backup system until another backup is run. This can be hacked around by doing VirtualFull backups – however this is rather nasty imho.

I ended up doing a script that uses the standard rsync tool to sync remote systems to a local storage area. I also keep hardlinked copies of the system that are rotated nightly.

This means I have $hostname/0 which is the latest FULL copy of the system, and $hostname/1-7 which contains the last 7 days of backups. The advantage of using hard links is that if a file is unchanged, it is linked in place – meaning no extra space is used (7 copies of the same file take the same space as 1 copy!). The end result of this means that I end up backing up multiple systems over multiple days in very little disk space. I coupled this with forced compression on a BTRFS filesystem – which saves ~15% extra space.

Now the big challenge. The majority of systems are RHEL (well, Scientific Linux to be exact) – which has MANY duplicate files. We’re talking kernel files, libraries etc that we can’t automatically hard link in place. While BTRFS promises a native deduplication process, it seems to be VERY slow (although when it does work, it should be nice). I tried quite a few different dedupe processes, but nothing really worked well for me… So I wrote my own.

You can restrict rsync access to certain ssh keys by using the rrsync script – usually distributed with the rsync package.

Download: rsync_backup
Download: dupe-remove

dupe-remove will allow you to run on any filesystem supporting hard links (in linux at least) and uses a bit of logic to try and make things happen quite fast (compared to others). When it first starts, it builds a list of files and grabs the filename, size and inode number. As I first sort with a file list indexed by inode number, we can easily ignore existing hard links and not have to process them with logic further on.

We then look at the filesizes of files – as logically, a file can only be the same if the filesize is also the same. This rapidly cuts the workload down by orders of magnitude. We end up with a list of files in memory and we can start hashing files (I chose SHA1) with the same filesize looking for exact matches. Then a simple walk of the final list will do the hard linking and free up a ton of space.

Example output:

# time ./dupe-remove.pl
Sun Aug  3 16:41:48 2014: Deduplication Process started
Sun Aug  3 16:41:48 2014: Building list of files.....*snip*...........
Sun Aug  3 17:04:48 2014: File list complete!
Sun Aug  3 17:04:48 2014: Filtering file list...
Sun Aug  3 17:04:48 2014: Done!
Sun Aug  3 17:04:48 2014: Performing quick hash............................
Sun Aug  3 17:09:38 2014: Done!
Sun Aug  3 17:09:38 2014: Filtering quick hashes...
Sun Aug  3 17:09:38 2014: Filter complete!
Sun Aug  3 17:09:38 2014: Calculating Full Hashes...
Sun Aug  3 17:10:11 2014: Full Hashes Complete!
Sun Aug  3 17:10:11 2014: Sorting Hashes...
Sun Aug  3 17:10:11 2014: Hash Sort Complete!
Sun Aug  3 17:10:11 2014: Creating hard links....
Sun Aug  3 17:10:16 2014: Hard links Complete!
Sun Aug  3 17:10:16 2014:

        DEDUPE SUMMARY:
Sun Aug  3 17:10:16 2014: Number of files checked: 461016
Sun Aug  3 17:10:16 2014: Number of files Quick Hashed: 12737
Sun Aug  3 17:10:16 2014: Number of files Full Hashed: 362
Sun Aug  3 17:10:16 2014: Hardlinks Created: 269
Sun Aug  3 17:10:16 2014: Total Logical bytes: 574,792,681,126
Sun Aug  3 17:10:16 2014: Total Physical bytes: 79,456,840,322
Sun Aug  3 17:10:16 2014: Bytes Saved: 65,868,768

real    28m28.624s
user    0m49.763s
sys     3m12.311s

This turns out to be MUCH faster than any other solution I have found so far.

As always, feedback on code is most welcome 🙂

Updates:
03-Aug-2014: Updated dupe-remove script. Changes include:

  • Do quick-hashes on large files (up to $quick_hash_kb) to trim large files from our lists quickly
  • Remove the shell call to get the date in the logger function
  • Use the perl glob to build the @backup_paths array and check that it actually works
  • Extra stats printed on completion
  • Remove the useless scalar function in filtering the arrays
  • Now logs all actions to $logfile to assist in verifying operation
  • Change the inode hash sorting to hopefully minimise disk seeking
  • Add counters for physical size, logical size and bytes saved by linking

  8 Responses to “Backups and Filesystem Deduplication”

  1. Backups shouldn’t be 0-7, they should be 1968-12-26-$host-id so we can add a comment that on 12-27 a monolith on the moon wiped out our systems.

    And it should be host id, not function id… lets say we replaces our WWW server but there was bit of overlap, who gets backed up in the timeframe of change.

  2. Looking into your solution it looks quite a bit like dirvish. Dirvish uses rsync and hardlinks. I have been very happy with dirvish for many years.

    Also your dupe-remove looks quite a lot like faster-dupemerge. Perhaps you can find some ideas in that code and merge the two.

    Thanks for your blog post.

  3. This is really fast. Notice, though, that the current dupe-remove will override a file’s owner, group and modification timestamp with another file with distinct owner, group, or timestamp but sharing the same SHA-1.

    • Correct.

      I use the following code to back up permissions / ownership data on each machine:
      find / -mount -ls | xz -z -9 > /file_list.xz

      This file will then get backed up with the rsync backup from remote systems to the backup machine.

      The RHEL rsync packages come with a file-attr-restore script that will take the output of this file and restore file permissions to the target filesystem in the case of a restore. It lives in /usr/share/doc/rsync-3.0.9/support.

  4. Okay. Still, this is not enough. Consider a filesystem where 2 files have the same SHA-1 but (i) distinct inodes and (ii) distinct owners. Your script will hard link those files to the same $linksource. This is undesirable.

    Because, first, ownership and permission rights, even when restored with the file-attr-restore script, will necessarily have to favor one of those 2 users. As a result, one user will end up seeing another user’s ownership and permissions.

    Second, because both users will end up sharing a single inode. Then in a restored filesystem, if user A modifies a shared inode, then changes will reflect on user B’s side as well.

    As to ignoring timestamps, rsync will get confused if a path gets a different timestamp from the file it becomes linked to.

    (As Per Marker Mortensen has pointed out above, helpful discussion of this kind of issue can be found at dupemerge.)

    In a nutshell, the criteria for $linksource should be same UID && same GID && same timestamp && same permissions, rather than just having the greatest nlinks.

    • You misunderstand the situation…. the backup of permissions is done on the host before the rsync to the backup system is started. As such, the permissions / ownership etc match.

      I suggest looking at both the script included in the rsync directory and at the output of the file created by the command above.

      If you do need to restore a backup, you will need to follow the steps with the rsync script to ensure things are restored properly.

  5. 3 more suggestions:

    * You need to decode find’s output for Unicode characters, otherwise files with non-ASCII filenames will get you into trouble. $_ = Encode::decode ( "UTF-8", $_ ) if /\P{ASCII}/;

    * You may also filter out files with zero bytes from the first file list, as those files have the same SHA-1 sum, but they do not necessarily mean the same.

    * You should probably filter out revision control repositories (e.g., the .git and *.git directories) from the file list as well.

    • I haven’t tested this with unicode, but it should still work. Testing welcome.

      You can filter out a minimum file size via the @args array for find. Really, spending the time / disk IO for files below 64k doesn’t seem worth it for me – but others use case may be different. The code has this example.

      You can probably filter out a few wildcards using the @args array too. you can supply any valid argument that find would normally accept.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)