Backups and Filesystem Deduplication


  • Rewritten in Jun 2020 for a more modern approach.
  • Previous rewrite was in August 2014.

When I originally wrote this guide, there was a lot of pieces missing that I had to fill the gaps myself. 6 years later, a lot has changed. I moved from IBM's Tivoli Storage Manager to my own hash based file deduplication script - which at the time was the least resource intensive method I could find - and now to this setup.

For the sake of preserving some history, and maybe a good learning exercise, I've preserved the original dupe-remove script.

This update includes:

  • Using BTRFS snapshots to speed things up
  • BTRFS deduplication using bees
  • Transparent compression of backups

The goals for my backup system have always been:

  1. Always be able to retrieve a 'full' system restore.
  2. Only do an incremental backup on a daily basis.
  3. Support compression and optionally encryption during file transfer.
  4. Be easy to maintain.

Step 1 - Prepare BTRFS filesystem

In my case, I've created a second partition on /dev/sda that is the backup volume. The advantage here is that you can add compression across that entire volume in /etc/fstab and not have to worry any further.

To enable compression, make your backup volume line in /etc/fstab look like this:

UUID=1fd7b2de-dca6-4fa1-a1d0-a87e83e6a492 /backups      btrfs   defaults,compress=zstd:1    1 1

Remount the drive for this to take effect: mount -o remount /backups

This will enable zstd level 1 compression, which is quite good, but also fast. Much faster than using lz4 or lzo. If you want to trade off speed for extra compression, you can increase this, but you'll get a smaller gain with each increase.

Note: If you have a spinning drive, but use it in a VM, you may have to tell BTRFS that you're not using an SSD - as a detected rotational speed of 0 will cause BTRFS to assume an SSD. To fix this, add ,nossd after defaults in /etc/fstab

Step 2 - Set up the clients

You'll need to already have a working ssh key setup to use this. On each client, we want to restrict the IP address that a key can be used from - and also use an rsync helper called rrsync to restrict what our backup server can do - ie only read.

Copy this to /root/bin/remote-commands - and we'll use this to wrap our backup session:

#!/bin/bash

SCRIPTNAME=`basename $0`
REMOTE_ADDR=`echo $SSH_CLIENT | awk '{print $1;}'`

logger "`basename $0`: SSH from $REMOTE_ADDR, Command: $SSH_ORIGINAL_COMMAND"

if [ -z "$SSH_ORIGINAL_COMMAND" ]; then
    echo "Interactive login disabled."
    exit 1;
fi

case "$SSH_ORIGINAL_COMMAND" in
    "rsync --server --sender"*)
        if [ ! -f /usr/share/doc/rsync/support/rrsync ]; then
            echo "rrsync is not installed."
            exit 1
        fi
        if [ ! -x /usr/share/doc/rsync/support/rrsync ]; then
            chmod +x /usr/share/doc/rsync/support/rrsync
        fi
        /usr/share/doc/rsync/support/rrsync -ro /
        ;;

    "create_file_list")
        find / -mount -ls | pigz -9 > /file_list.gz
        ;;

    *)
        echo "Permission denied, please try again."
        exit 1
        ;;
esac

The two functions of this script is to wrap around rrsync, but also create a list of all files on the system, along with their ownership and file permissions. This can be used with the rsync tool file-attr-restore or referenced manually if required.

Restrict your SSH key so that it can only run this script in /root/.ssh/authorized_keys and add your backup server IP in as additional security:

from="<backup server IP>",command="$HOME/bin/remote-commands",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA......

Step 3 - Set up the backup server

Download the rsync_backup script and put it in /root/bin. Make sure you chmod +x /root/bin/rsync_backup.

This script works mostly by magic. Create a directory with the hostname or IP address to back up in /backups/, and then create the following files. By default, this script keeps 7 days of backups. You can change this by modifying the maxdays variable at the top of rsync_backup.

If for whatever reason, you can't use BTRFS, each btrfs subvolume command has a commented out shell command that was used prior to moving to BTRFS. You can switch this around, but you obviously won't get deduplication or compression features in this guide either.

In /backups/$host/excludes - a list of files or directories to exclude from the backups. A good starter template is:

/dev
/mnt
/proc
/run
/sys
/tmp
/var/cache
/lost+found

If you need to specify options only to be used with this host, you can create /backups/$host/options - for example to enable transfer compression:

RSYNC_OPTIONS="-z"

Note: Manually connect to each client device at least once, just to accept the SSH keys - otherwise the automated connection will fail.

We now use systemd timers to start the backup service each day. This has some problems as systemd really doesn't like long running processes. We work around this by using the at scheduler within a wrapper. Ugly, but the only way I found to get around this basic systemd issue.

Create /root/bin/run_daily_backup and enter the following:

#!/bin/bash
echo /root/bin/rsync_backup | at -m now

Now lets create the systemd service in /etc/systemd/system/run_daily_backup.service:

[Unit]
Description=Run daily backup

[Service]
Type=oneshot
ExecStart=/root/bin/run_daily_backup

[Install]
WantedBy=multi-user.target

And now the timer in /etc/systemd/system/run_daily_backup.timer:

[Unit]
Description=Run the daily backup

[Timer]
OnCalendar=*-*-* 02:00:00

[Install]
WantedBy=timers.target

This will run the backup every day at 2am local time. Adjust the schedule as you like.

When ready, reload systemd and activate the timer:

# systemctl reload-daemon
# systemctl enable run_daily_backup.timer --now

Step 4 - Setting up Deduplication

I use Fedora 34 currently, so there's a nice prebuilt package for bees available in copr.

Enable the repo via dnf copr enable szpadel/bees, then install via dnf install bees.

We now need to find the UUID of our partition used to store backups - in my case, that's /dev/sda2:

# blkid /dev/sda2
/dev/sda2: LABEL="BACKUPS" UUID="1fd7b2de-dca6-4fa1-a1d0-a87e83e6a492" UUID_SUB="394e3793-1739-4b77-ba96-14dbdb21f2d7" BLOCK_SIZE="4096" TYPE="btrfs" PARTUUID="7bfeb870-02"

Grab the UUID, then enable bees for that partition:

# systemctl enable bees@1fd7b2de-dca6-4fa1-a1d0-a87e83e6a492 --now

This should start up deduplication as soon as the command returns. If you already have some data on this drive, it'll take quite a while to occur - but then each write to the filesystem will trigger bees to do its magic.

If all your clients are running the same OS, you'll likely save a ton of space by enabling bees.

Step 5 - Sit back and relax

This should be everything you need to do. If you've set up email for root to come to you, you should get an email report at the end of the backup run with stats on what happened - as well as any problems.

Bees will sit in the background and do its thing transparently. Compression will just happen automatically.

I've been running this setup for well over a year now, and the biggest problem I seem to come across is a connectivity issue when a host is offline for whatever reason.

You'll always have the latest snapshot in /backups/$host/0 - which is a complete replica of the filesystem of the remote client.

Comments

Comments powered by Disqus