scribble

Holistic Engineering

E-Mail GitHub Twitter

03 Mar 2012
rsync: the swiss army chainsaw of backup utilities

Update: after writing this, Phil Hollenback told me about rsnapshot which looks to be a better solution for most use cases. Additionally, I have added chef recipes that implement some of the things seen in this article.

Time Machine is a pretty neat little thing, but it’s not the mother of all backup utilities; that title belongs to rsync. This article goes into automating your backup process across Unix derivatives with the versatile tool, and some orchestration to feed remote backups to a home network that exists on a possibly-dynamic IP address.

Likely if you’re here, you’re familiar with both tools and more-or-less what they do. However, before I go into detail, some history here is required…

So, the late great Steve Jobs is of the mind that the I/O bus Thunderbolt should be the new standard for high performance external storage. While that’s fine and dandy at all and largely remains to be seen, tell that to my USB3 and SATA capable home RAID enclosure I like to store backups on. At some point I flipped out at seeing Time Machine take days to finish a backup when I knew just copying files over the network (or even faster methods, we’ll talk about that below) would be several orders of magnitude faster, and I don’t even have anything fancy network-wise at home. To add insult to injury, anyone who’s built a hackintosh and actually tried that eSATA or USB3 port… knows it works. :)

I had two real options. I could drop another $1200+ on one of these (I actually have one of these at work and they are wonderful. They’re just not realistic or cost-effective for what I want to do at home.), or I could get old school and go back to rsync, since Time Machine isn’t really how I restore machines anyhow; usually by the time I’m ready to rebuild a machine I want to eliminate the cruft on my machine, not restore it. rsync gives me the best of both worlds by giving me a full machine copy AND the ease of use to pull individual files/trees out when I need that.

While we’re here singing the praises of rsync, let’s sing the praises of hard links as well. rsync + hard links is a battle-tested backup method used on sites with a lot of real data and gives you incremental backups with a minimal amount of work.

So let’s start with the basics:

What’s a hard link?

The reason I’m asking this question, is that this is a surprisingly oft-missed question in interviews, and confused when working with co-workers, etc. Comprehending it properly means that you need to have a deeper understanding of filesystems than the file/directory high level.

Let’s get some basic axioms down before we go into discussion:

  • Hard links (provided by ln with no arguments) are not symbolic links (ln -s).
  • Hard links cannot span filesystems. Symbolic links can.
  • Hard links cannot reference directories. Symbolic links can.
  • Symbolic links are separate files on disk. Hard links are not.

So fundamentally a filesystem is a hash table of references, or pointers if you want to be more precise. The key is the name of the file and the value is the pointer to the head of the data. In the event of a directory, the directory key is the name and the value is…. another hash table.

(For you advanced readers: yes, I am glossing over a lot of shit; I only have so many characters to share with the world in a narcissistic attempt to show how awesome I am.)

To use hash syntax from perl and/or ruby, it looks a bit like this:

/ => {
    file1 => 0xDEADBEEF,
    file2 => 0xDEADFACE,
    dir1  => {
        file3 => 0xCAFEBEAD,
        file4 => 0xBEADCAFE
    }
}

What a hard link does is create another entry in a directory of your choosing to a file that references a pointer that coincides with data.

To use language garbage-collection terms: the crux of the issue here is that all files are hard links, with a reference count of one. Creating another hard link with ln increases the reference count. Files with a reference count of zero that are not held by processes in memory are reclaimed as free storage.

These pointers are called inodes, and you’ll see them with tools like df -i and mkfs. They have a minimum size and lots of other important properties and data that I’m glossing over. You can read about them here.

So, doing this:

ln /file1 /file3

Creates this:

/ => {
    file1 => 0xDEADBEEF,
    file2 => 0xDEADFACE,
    file3 => 0xDEADBEEF, # right here, guys
    dir1  => {
        file3 => 0xCAFEBEAD,
        file4 => 0xBEADCAFE
    }
}

Easy, right? Let’s get to why I’m telling you about all this.

The magic of rsync’s --link-dest

First off, let’s RTFM. From man rsync:

--link-dest=DIR         hardlink to files in DIR when unchanged

This means: if you specify a directory here, it’ll be consulted for differential purposes. Unless the file is different, it will be hard linked into the target directory (specified at the end of your command line).

Here is some orchestration that exploits this property:

#!/usr/bin/env perl

use strict;
use warnings;

use DateTime;
use DateTime::Duration;
use DateTime::Format::Strptime;

our $HOSTNAME    = `hostname -s`;
chomp $HOSTNAME;
our $BACKUP_DIR  = "/backups/$HOSTNAME";
our $BACKUP_USER = "erikh";

my $today = DateTime->now;
my $yesterday = $today - DateTime::Duration->new(days => 1);
my $host = $ARGV[0];

my $yesterday_ymd = $yesterday->ymd;
my $today_ymd = $today->ymd;

my $user_string = "$BACKUP_USER\@$host:";

if ($host eq 'localhost')
{
    $user_string = "";
}

system("sudo rsync -a --numeric-ids --exclude /backups --exclude /dev --exclude /proc --exclude /sys --rsync-path='sudo rsync' --link-dest=$yesterday_ymd / $user_string$BACKUP_DIR/$today_ymd");

This is a small backup tool I wrote which exploits --link-dest. Backup directories are organized by date, and the previous date is used for --link-dest when performing the new rsync. Files are hard-linked in if they’re equivalent, reducing transfer times and by side effect space used. On the local network (1 Gbps), this takes about 4 minutes for a daily backup of my main mac workstation.

Directories look like this:

[2] erikh@chef10 /backups% ls
./  ../  chef10/  coffee/  extra/  lost+found/  speyside/

And inside one of them:

[4] erikh@chef10 /backups/speyside% ls
./  ../  2012-02-25/  2012-02-26/  2012-02-27/  2012-02-28/  2012-02-29/  2012-03-01/  2012-03-02/  2012-03-03/

Each directory has a full backup, but only the files that are changed take space. Each machine has roughly 250GB of stored space save the server which has around 30GB. The mount point is using 1.2T and the extra directory is the exception taking around 800G of that. Not bad for a week’s worth of backups across 3 machines.

Over a month, it’ll be even more amazing. Speaking of which, here’s a small script which prunes the backups older than 30 days (space saving or not, you’ll eventually run out of space if you leave these unattended):

#!/usr/bin/env perl

use DateTime;
use DateTime::Duration;
use DateTime::Format::Strptime;

use constant BACKUP_DIR => "/backups";
use constant DAYS => 30;

my $today       = DateTime->now;
my $last_month  = $today - DateTime::Duration->new(days => DAYS);
my $dt_format   = DateTime::Format::Strptime->new(pattern => "%Y-%m-%d");

my $backup_dir = opendir(HOST_DIRS, BACKUP_DIR);

for my $host_dir (readdir(HOST_DIRS)) {

    my $full_host_dir = BACKUP_DIR . "/$host_dir";
    next unless -d $full_host_dir;
    next if $host_dir =~ /^\./;
    next if $host_dir eq 'lost+found';

    opendir(DIR, $full_host_dir);

    for my $dir (readdir(DIR)) {
        my $full_dir = "$full_host_dir/$dir";

        next unless -d $full_dir;
        next if $dir =~ /^\./;

        my $date = $dt_format->parse_datetime($dir);

        system("rm -rf $full_dir") if ($date < $last_month);
    }

}

It just walks each tree and prunes any dirs which are earlier than the last 30 days.

Off-site Backups

Now that we have on-site backups, what happens when we have machines that are outside of the firewall’s reach that we want to bring to the same backup repository?

The great thing about rsync is that it can be used in conjunction with ssh allowing for lots of tunneling options and increased security. However, we’re going to do something simpler: open a port.

Airport Configuration

So now that’s established, but we need to be able to find out where we’re sending the data. We’re on a dynamic IP here at home, so some accounting for that is also required; and we can’t just take ifconfig’s values since we’re behind a NAT and so on and so forth.

So the route here is effectively:

coffee.hollensbe.org -> home.hollensbe.org (NAT) -> chef10.local

Let me introduce you to my little friend: jsonip. A really straightforward, just-give-me-an-ip solution that doesn’t require scraping or any other business. I decided to implement this myself since, you know, the internet is fleeting and lots of things tend to disappear. Sinatra is a great utility for such small services. Here’s the code:

require 'rubygems'
require 'sinatra'
require 'yajl'

get '/' do
  headers["Content-Type"] = "application/json"
  return Yajl.dump({:ip => request.ip })
end

This runs at jsonip.hollensbe.org. If you’d like an example of how to set this up with unicorn, look here.

Now we have a way to get the external IP address of our network, but we need a way to communicate any IP changes to our remote machines.

Enter BIND. BIND is the swiss-army-chainsaw of DNS servers and if you don’t know it, you should. BIND has a variety of ways to update an IP address, but we’re going to use the nsupdate strategy here, since it’s the simplest.

nsupdate uses a simple pre-shared key approach which is transferred as a hash – similar to the way your /etc/passwd and /etc/shadow files work. The solution here config-wise is pretty simple:

key "hollensbe.org" {
    algorithm hmac-md5;
    secret "my-secret";
};

zone "hollensbe.org" {
        type master;
        file "/etc/bind/domains/hollensbe.org";
        allow-update { key hollensbe.org; };
};

What this allows us to do is send a message to our BIND server that says ‘let me publish updates to the zone file’. There are a few caveats mainly revolving around rndc freeze and rndc thaw to this so I suggest reading the manual before continuing with this approach.

Here is a small script that I wrote that:

  • Retrieves credentials from a YAML file in /etc.
  • Pulls the IP from jsonip.hollensbe.org.
  • Resolves DNS for home.hollensbe.org.
  • If they differ, it sends an nsupdate to the BIND server with the jsonip address.
#!/usr/bin/env ruby

require 'yaml'
require 'resolv'
require 'json'
require 'open-uri'

HOSTNAME = "home.hollensbe.org"
JSONIP_SERVICE = "http://jsonip.hollensbe.org"

DNS_INFO = YAML.load_file(ENV["TEST"] ? "dns_info.yaml" : "/etc/dns_info.yaml")

ip = JSON.load(open(JSONIP_SERVICE).read)["ip"] rescue nil
resolved_ip = Resolv.getaddress(HOSTNAME) rescue nil

if ip.nil? or resolved_ip.nil? or ip != resolved_ip
  puts "Updating #{HOSTNAME} to be ip '#{ip}' (previously '#{resolved_ip}')"

  IO.popen("nsupdate -y #{DNS_INFO["key"]}:#{DNS_INFO["secret"]} -v", 'r+') do |f|
    f << <<-EOF
      server #{DNS_INFO["server"]}
      zone #{DNS_INFO["key"]}
      update delete #{HOSTNAME} A
      update add #{HOSTNAME} 60 A #{ip}
      show
      send
    EOF

    f.close_write
    puts f.read
  end
end

Now our backup.pl can point at home.hollensbe.org over SSH and perform backups normally, all without having to worry or care about IP changes.

That’s all I’ve got! I hope you found this informative.


Til next time,
Erik Hollensbe at 14:18

scribble

E-Mail GitHub Twitter