Whether you are a web host, an IT professional or a hobbyist, if you have one or more headless Linux servers, backups are one of your main concerns. And incremental backups are your friends.
Amazon S3 (http://www.amazon.com/aws) offers a very interesting storage solution, especially since they lowered their already low rates.
It even comes with a tool called ’s3sync’ that works a bit like rsync.
Unfortunately it only works ‘a bit’ like rsync; it does not check if a remote file with the same timestamp and/or checksum exists, and it ends up using up much more bandwidth than you originally planned for.
This could be avoided if S3 supported rsync; unfortunately it does not.
An ideal solution would be a local rsync, which would be smart about copying files.
No problem! This can be achieved by mounting an S3 bucket as a local disk; here is how.
In this tutorial, I am going to refer to your headless web server as your ‘Linux box’. So, if your local box also runs Linux, please bear with me: I am not talking about this box!
Here we go:
Download the USB key version of Jungledisk from http://www.jungledisk.com/download.shtml and run it on your local machine. It does not matter if you’re running Windows, OS X or Linux. All three systems are supported. Jungledisk will create a configuration file called ‘jungledisk-settings.ini’
If you need to modify this file, run junglediskmonitor and modify your settings.
Download the Linux version of Jungledisk, expand the archive and copy ‘jungledisk’ to its own directory on your Linux box.
Copy your local jungledisk-settings.ini to your Linux box in the same directory as ‘jungledisk’. You do not need to copy and run ‘junglediskmonitor’ on that box, since all it would do is modify the .ini file.
Everyting from this point on is going to happen on your Linux box.
If you do not have Fuse, download it from http://sourceforge.net/project/showfiles.php?group_id=121684&package_id=132802
Expand the package, go to the new Fuse directory.
Type:
./configure --enable-lib
make && make install
If you forget --enable-lib, the dynamic library will not be created.
Alternatively you could use Coda instead of Fuse but it tends to come with your OS and you may end up using a wrong version of Coda with Davfs. And I like the flexibility of Fuse. Research it a little bit: installing Fuse buys you much more than what is covered in this tutorial.
Make and install libneon. You may already have a neon package installed. If so, you can try to proceed without downloading a newer version. If you systematically experience ‘mount’ - cf below - complaining about ‘invalid URL’, I recommend upgrading libneon.
Get it at http://www.webdav.org/neon/
Expand, go to the new directory, type:
./configure
make && make install
Make and install davfs. Get it at http://dav.sourceforge.net/
Actually, you may find a binary that perfectly suits your box and not even need to build it. Otherwise, type:
./configure
make && make install
Let’s create a user/group for davfs:
groupadd davfs2
useradd davfs2 -g davfs2
You now have all the pieces required. Let’s get crankin’:
Run ‘jungledisk’:
./jungledisk
You are likely to have a text-mode browser, either ‘lynx’ or ‘links’. Let’s pick one -in my case: ‘links’
links http://localhost:2667
If it displays an XML header rather than rejecting your connection, then jungledisk is running correctly.
Time to mount your disk. Here is how:
mkdir /mnt/J # (or any other directory you wish to use)
mount http://localhost:2667 /mnt/J --nolock -t davfs
The --nolock may seem a bit superstitious but in my experience mounting a remote FS without it can lead to catastrophic crashes.
You now have an unlimited disk drive mounted at /mnt/J/
If all this works correctly, you can now start thinking of mounting your disk at boot-up time.
First, you need to start ‘jungledisk’ everytime. Either create a start-up script in /etc/init.d and use chkconfig or if you wish to keep things simple, add a call to jungledisk at the end of your /etc/rc.d/rc.local. I have to check that the latter works though as I believe that Linux will try to mount your disk before calling rc.local
And finally, let’s make sure that we have an entry in /etc/fstab:
http://localhost:2667 /mnt/J davfs noauto,user 0 0
That’s all!
Oh, wait. Right. What about backups?
As I wrote earlier, rsync remains your best bet. Here is how to run it:
mkdir -p /mnt/J/home/1
rsync -aHx --numeric-ids --no-whole-file --size-only /home/ /mnt/J/home/1/
This example shows how to perform smart backups of your /home directory; the reason why I am pushing this backup to a directory called ‘1′ is that I am performing real incremental backups with automated backup rotation.
If you have questions about it, I will write another tutorial where I explain how to use hard links to rotate backups without wasting disk space. But in the meantime, do not let that ‘1′ scare you.
So, what’s this then?
--no-whole-file is extremely important: rsync thinks that you are copying your files to a local disk and decides to favour local speed as opposed to saving on bandwidth usage. This can be very bad. This parameter forces it to be smarter than that.
-a tells rsync to use an ‘archive’ mode where all important files information is preserved
-H allows us to preserve hard links - not covered by ‘a’
-x is important as well: you will not end up creating back ups of whatever remote devices you have mounted locally
-size-only new: suggested by, I guess, Eric Johnson, is important for the reasons enumerated in his comments below (mtexte)
Eric also suggests using ‘inplace‘ but I am not so sure about that yet.
OK, *now* that’s all (for now!)
Sphere: Related Content
I haven’t used JungleDisk in a year or so, but does it still make a random S3 bucket like “b56c0253fa575e423b5e…” and create weird file names? When I wrote my other post I had actually considered pretty much this exact route to save on storage/bw but I avoided it for my first reason, if it’s still applicable, and so I could have “snapshots” and go back to see certain versions of files. Aforementioned other post: http://paulstamatiou.com/2007/07/29/how-to-bulletproof-server-backups-with-amazon-s3/
Paul,
Yes, it still creates “interesting” bucket names. It postfixes them with the name you were hoping to create, but still!
If I want to recover a previous version of a file, I suppose that I could do it using a trick that involves hard links. However, I have to wonder how Jungledisk handles links…at best, I fear that it would cost me in pure S3 ‘put’ request.
Hello,
As far as I understand there should be lot of I/O bandwidth between the local rsync process and the “local storage” ( S3 in this case). Most of the bandwidth is so digital sign block of the candidate file. the signature then compare to the remote rsync. That mean that even when you use –no-whole-file you will end up using a lot of costly traffic to the S3 storage.
In the other end I know that jungledisk cache a lot in memory. Did you check the bandwidth statistic to S3 after running rsync on already sync directory?
Regards,
Addady
Thanks for the helpful post, Chris!
I wonder, though, about your rsync options. The folks at JungleDisk suggest using both the –inplace and –size-only flags, like this:
rsync -r –inplace –size-only /home /mnt/s3
These settings compensate for the fact that S3 (like any WebDAV server) doesn’t preserve access times, and without them rsync will back up the same unchanged files on every run.
Does –no-whole-file even work going to S3 — which doesn’t support incremental writes? As Addady says, we need stats for this.
How-to set up JungleDisk on Linux — without XWindows:
http://el-studio.com/article/jungledisk-linux-backups
For your Ubuntu Dapper (6.10) servers, no building from source — unless you want to.
Good comments!
Yes, I had toyed with the idea of using size-only but was missing the crucial piece of information that you just posted about WebDAV servers. I am going to update my original post.
Addady, you are correct about the cache: it is a big piece of my backup strategy.
[…] Cheap Server Backup with Amazon S3 (tags: amazon AWS backup linux rsync s3 jungledisk fuse sysadmin) […]
This is a nice post– thanks. But I still get the “invalid URL” error when I try to mount davfs.
+ mkdir /mnt/jd
+ mount.davfs http://localhost:2667 /mnt/jd -o nolocks
mount.davfs: invalid URL
I’ve tried several versions of neon, and currently have the latest version (neon-0.27.2). What neon version are other folks running?
This idea of mounting my S3 bucket as local disk drive and then use rsync will not work well.
It will not provide any bandwidth efficient algorithm and upload the whole new file, not just what was changed.
Only when you connect to Rsync service on the remote side (For example http://www.s3rsync.com in Ec2 servers at Amazon), you can fully benefit form Rsync power.
Rsync to “local” drive is a bad idea since it uploads the whole file, as mentioned above the backup process is much more slower.
I’ve modified this to better suit CPanel based sites with sql support at http://duivesteyn.net/2008/amazon-s3-backup-for-webserver-public_html-sql-bash/
hope it helps someone