Backups suck — a rant

I’m not even mad, I’m just disappointed. I’m tired. Tired of trying to force my cloud-shaped peg through a tape-shaped hole, of custom data formats and convoluted protocols, and of half-assed systems that work well as long as I have just one machine to manage, with just one kind of data on it — unless I want to hack all the management for different data myself. The current state of open source backup software is sad. I have even tried looking at commercial solutions, but couldn’t extract any real information on what’s inside the box from the enterprise marketing copy. Are my expectations unrealistic?

I’m writing this post freshly after a single restore of 370 gigabytes of database that took almost a week of wrestling with broken storage, incomplete archives, interrupted transfers, stuck communication, and — most of all — Waiting for Stuff to Complete, for hours and hours. In fact, much of this post has been written during the Waiting. Many of the issues I have wrestled with were caused by mistakes on my side: misconfiguration, insufficient monitoring that should have detected issues earlier, and the fact that over last month I was not able to pay enough attention to the day-to-day maintenance, which allowed the suckage to accumulate. At the same time, the software systems should automate out the common parts, make it easy to get things right, and be easy debug when they aren’t. All of the backup systems I’ve ever used or seen fails miserably at two or more out of these three points.

But let’s start from the beginning.

Why do we even need backup?

When we hear the word “backup”, we usually imagine a disaster: say, a database server has failed, everything on it has been lost, and we need to get a new one up, as quickly as possible, losing as little data as possible. But this is just one of many cases when backups are helpful.

Actually, for this particular case, backups aren’t even the best tool; online live replication will be quicker to replace the failed piece (just promote the slave to master, and we’re done), and will lose less data (only the replication lag, usually in range of single seconds). The replication slave can be even used to get some load off the main server by responding to read queries that don’t need perfectly synchronized results, such as analytics and reporting.

This often leads to the conclusion that a backup system isn’t needed after all. Come on, it’s 21st century, we have online replication with heartbeat checks to automatically promote the slave. Why would we need a bunch of static archives that take up space to store and time to recover, what is it, 1980s?

But not only disasters are dangerous to data, and keeping an archive of history snapshots can be useful in many other cases, including recovery from PEBKAC problems where data was damaged by someone, or when an application bug sent a DELETE query to the database that got happily replicated to the slave, in realtime. If you have actual backups — a history of static snapshots going back into the past — you won’t be bothered by hearing any of these:

  • Hey, man, that customer just wrote us they can’t see the old comments on their widget listing page. They swear they didn’t click anything (yeah, right), and that the comments were there 17 days ago. Can you bring them back?
  • Our WordPress has just been hacked, and some code has been injected into the PHP files. Can you check when did it happen and give me the latest clean version to compare?
  • I have worked on that Accounting Spreadsheet three months ago, and I must have deleted it when cleaning my desktop. HALP?
  • Can we test the new release on our full production dataset? The database migration transforms every single document, and we hope we got all destructive corner cases, but you know how creative our users are…
  • I’m preparing a report for the investors, and need some growth figures – do you have some records on amount of customer data we keep for the last three years?

Once you have backup policy right, it’s a time machine where nothing of value is truly lost. It is a safety net for the business. Why then the state of backup software now looks even worse than monitoring software in 2011?

21st Century Backup Checklist

What would I expect of a perfect backup system? Besides seeing into the future and storing only the data that I will actually need, instantly available at the moment it’s needed, and compressed not to take any storage space, that is? A decent backup system would be, in no particular order:

  • Using standard formats and tools. I want to have clear and simple recovery procedure if all I have is backup volumes and a rescue boot / recovery CD. Needing an index with encryption keys and content details is still fine, if the system will generate that for me in a readable format.
  • Encrypted and compressed. You can’t trust your datacenter anymore when you’re in the cloud, spread across five data centers, three continents, three hosting providers, and two storage providers. It’s not your company skyscraper’s enforced concrete cellar anymore. I want my backups to be transparent to me, and opaque to the storage provider.
  • Centrally managed, but without bottlenecks. There should be a single place where I can check status of backups across whole system, drill down through a job’s history, trigger a restore or verification, and so on. On the other hand, some of the heavy lifting should be on the node’s side; in particular, if I’m using cloud storage, the node should directly upload the encrypted data to the storage, rather than push the same data to the central location, which would then bounce it to the cloud.
  • Application-specific. I want to be able to use my tools, that are standard enough for me. If I’m backing up MySQL, I prefer xtrabackup to tar.
  • Zero–copy. I don’t want to have to copy the original data to another directory, then tar it up to an on-disk archive, then encrypt the tarball — still on disk — and only then copy it to the storage. This can and should be done online in a pipeline. We work with terabytes of data nowadays, needing double or triple local storage just to do backup is silly.
  • Able to use different storages. I want to be able to use three cheap unreliable storage providers. I don’t want to be locked in any single particular silo. In particular, I don’t want to pretend that my backup storage is made of pools of magnetic tapes and keep a set of intricate scripts pretending that Amazon S3/Glacier, my local disk directory, or a git-annex repository is a tape autochanger.
  • Supporting data rotation. I want to be able to delete old volumes, not just to add data and keep forever. I want to be able to easily specify how I want the rotation to work, and to change my mind later on.
  • Supporting data reshuffling. It should be possible to move a volume between storages: keep fresh data in the same datacenter, at the same time archive it in the cloud, and put monthly snapshots in deep freeze. If I feel like switching storage providers, adjust my expiration scheme (and apply it to already created volumes), or just want to copy data around manually, I should be able to get it done.
  • Secure. No node in the system should access other node’s backed up data. In fact, no node should even access its own historical data without explicit permission. The less node itself can see or has to say, the better. It is especially bad if node has direct access to underlying storage: in case of a break-in, the attacker can not only take down the server, but also delete all its backups (sometimes even other nodes’ backups)
  • Flexible. I want to be able to restore one machine’s backup to another machine. I want to be able to restore just some files. I want to use backup system to keep staging database up to date with production data, or to provision load test infrastructure that mirrors recent production.
  • Scriptable. I want to be able to give my client a “type this one command” or “push that one button” restore instruction, not three paragraphs. This includes restoring production backup to staging database and deleting sensitive data from it.
  • Testable. I want simple way to specify how to verify that I not only have the backups, but I’ll also be able to restore them. In perfect situation, a one-button fire drill that brings back a copy of production environments and checks that it’s readable. And this button is pushed on a regular basis by the monitoring.

Where do we stand now?

Some of currently available systems meet some of the above points.

Bacula is centrally-managed, secure, flexible, can compress volumes, and with some gymnastics it can rotate and reshuffle volumes, be application-specific and do zero-copy backups and restores (though I haven’t managed to do that, or to even see any tutorial or report from anybody who have done that). It fails miserably when it comes to transparency, standard formats, encryption, and storage backends. And I have to pretend to it that all storage I have is on tapes. It’s awfully clunky and hard to debug when something doesn’t work like it’s supposed to. It’s also underdocumented.

Duplicity handles one machine and one dataset, but it is good with encryption, standard data formats, different storage backends, and rotation. I think it can also be made application-specific, though zero-copy backups may not be possible. Without central management, security and flexibility are irrelevant.

Some people have success with BackupPC, but it’s file-centric, too much decision is left to the node being backed up, and it seems to be focused on backing up workstations.

Other systems that I have looked at either are individual low-level pieces of puzzle, are focused on individual machines rather than whole systems, or are overcomplicated behemoths from the 90’s that expect me to think in terms of tape archives. I couldn’t get through marketing copy of commercial solutions, but I don’t have high hopes in these.

There is one more project I forgot to mention when writing this rant yesterday: Obnam. It is built on some very good ideas (deduplication, data always visible as full snapshots even when incremental, encryption). However, it is still focused on a single node, on backing up files (haven’t found any info about application-specific formats and tools), and uses custom, opaque storage format (which seems to be the price for deduplication — a necessary design trade-off). Without any special means, the node can access — and overwrite or delete — its backup history. If we choose to share a repository between many nodes, each node has also access to all other nodes’ backups.

Of all these, Bacula is closest to what I imagine to be a good solution: it got the architecture right (director for job/volume catalog, scheduling, overall control; storage to store files; client to receive commands from director and push data directly to/from storage). On the other hand, its implementation is just unusable. Its communication protocols are custom, opaque, and tied to the particular version; storage layer could have alternative implementations, but it’s practically impossible, as there is no complete specification of the protocol, nor any hints which parts of it are stable, and which can change. The storage, even on disk, is designed in terms of tape archives. There are even no hooks for disk volumes to allow moving the files around in a smarter way (e.g. with git-annex). Its configuration is more fragile and idiosyncratic than Nagios’. The whole system is opaque and debugging it is a nightmare. It’s not scriptable at all (to make a script that just starts a predefined restore job, I had to use Expect on bacula’s CLI console). It uses custom storage formats that cannot be easily restored without setting up whole machinery. It supposedly can do zero-copy application-specific backups, but it requires quite fragile configuration, and I haven’t ever seen a single working sample of that, or report of anybody having done that. All of these issues are buried deep inside Bacula’s implementation and design. It’s hopeless.

I have also learned that there is a fork of Bacula, named Bareos. Not sure whether it’s actually going towards something more transparent, or just maintaining the hopeless design.

What now, then?

Now, I’m trying to estimate how hard would it be to create a proof-of-concept system based on Bacula’s architecture, but using open protocols, standard tools, and encryption: https for communication, client SSL certificates for authentication, tar as a storage format, gnupg for encryption. For an initial implementation, storage could be handed off to S3 (with its automatic Glacier expiration), but using a façade API that would allow to use different backends. In the last few days, I’ve been playing around with Go to try and implement a proof of concept sketch of such system. This direction looks promising.

If you have any remarks, want to add anything, maybe even offer help with development or design of such a system — or tell me that such system already exists, which would be a great news — feel free to use the comment form below, Twitter, or Hacker News. And if you’re by any chance in London at Velocity EU, just catch me around, or look for a BoF session.

Edits

2013-11-13
Added additional remark about security, mentioned Obnam and Bareos

Don’t name your config Toolfile (or .tool.yml)

There is a pattern, mostly used in Ruby world, that if you write a command line tool, named – say – The Tool, and it has a configuration file tied to a project you use it with, then the tool’s configuration fie is called Toolfile, and lives in the root of the project’s directory.

Don’t do that. Please.

I suppose that this started a couple centuries ago, with mother of all the build tools Make and its Makefile. Ruby build tool Rake used the existing convention for build tools, and named its configuration file Rakefile. This makes sense: in a software project, build script is the entry point. It’s logical and convenient that it lives in top-level directory, and it’s easy to locate, because the name is capitalized.

Rake followed a useful convention for build script. But then, all other tools followed Rake without stopping to think if it’s useful. Now I have a repository that has a Rakefile, Thorfile, Gemfile, Berksfile, Vagrantfile, and Strainerfile – there are more Toolfiles in there than real content.

The other custom is to name the config .tool.yml – also in project root. While Toolfiles clutter the directory by being too visible, dotfiles files do the opposite – they disappear. Some projects get this right and use dotfiles or dot-directories for stuff that should be hidden: .rvmrc, .bundle, or even .chef are good examples. Still, .travis.yml or .rspec are ugly. Even without the dot, cucumber.yml or Jekyll’s _config.yml smell quite bad.

How to name configuration files, then? Simple: use a config directory. Let me put all the config in the config/ subdirectory instead of cluttering the root and making it harder to spot the README file. Seriously, what would be wrong with config/thor.rb, config/gems.rb, config/berkshelf.rb, or config/travis.yml? It would be even possible to adapt to different projects’ conventions by using a dotfile (that is supposed to be hidden) pointing to the directory, and looking for the config like this:

  • If .config is an empty file, always use Toolfile and don’t look at any directory
  • If .config is a file, go with `cat .config`/tool.rb
  • If .config is a directory or symlink pointing to one, `realpath .config`/tool.rb
  • If there is no .config, but config/tool.rb exists, use it
  • At the end, try Toolfile

By default, tool would look for config/tool.rb or Toolfile, and any project that uses a different config directory could use .config file or symlink to override this. Vendorificator already partially implements that, and I think that I’ll soon release a small gem that would deal with config files, including the .config override. Some other options may be useful, like finding multiple config file fragments by directory or glob.
Let’s hope that in a year or two I’ll be able to move at least some of the configuration files away and won’t need to find my project’s files between the configs!