mdlbear | Of blobs, and hash, and CVS... (Reply)

...and other geeky things.

Since I've started to use a version control system called git for my recording projects, and since the subject has come up in the comments to my last post, I thought I'd dive a little deeper into git and why I'm using it. (This is a good summary of git's features.)

Unlike CVS and Subversion, but like Mercurial, BitKeeper and SVK, git is a distributed version control system. That means that, instead of having a single centralized repository that represents the authoritative version and all of its history and branches, every working copy includes its own repository. A project can also have shared repositories. Operations like branching, tagging, cloning repositories, and creating and merging patches are blindingly fast, and git scales up easily to huge projects and long histories. This is not surprising, since it was designed for the Linux kernel.

A git repository is just a directory tree full of files; you can access it any way you like: in the filesystem, through an ordinary web server, via rsync, or through its own, somewhat more efficient, protocol. This is a distinct advantage over CVS, which can only use its own protocol, and Subversion, which can use the web but requires you to have access to the server's configuration so that you can set up WebDAV. You can just plop a git repository down on your ISP's server, and it will work.

Now, let's get more specific. Down in its ingenious guts, git is little more than a representation for unique instances of files. A "blob" is a file that has been preceeded with a small header, SHA-1 hashed, and compressed (with deflate). Its identifier is the hash, which makes blobs unique and immutable. The commit log is just a chain of blobs, with each one referring to (the hash of) its predecessor. A directory is represented by a blob that contains an easy-to-parse listing of filenames, mode bits, and blob hashes. (There are also "packs", which are files that pull together many blobs, for storage efficiency, along with a hash table for fast access.)

One of the consequences of this design is that it's particularly easy to write small, specialized programs that operate on git blobs and repositories, and that's really all git is -- a collection of over 100 little programs that somehow manage to add up to a version control system.

All of this makes git almost ideal for collaborating on music using Audacity. That's because an Audacity project is a directory full of 1MB raw audio files tied together by an XML file. The audio files are never modified -- if you want to apply an effect to a track, it makes a modified copy. This means that if you copy an Audacity project to make your own changes, all the hundreds of megabytes of audio that you copy in the process will take up no extra space at all in the git repository. You haven't changed them, so they still have the same hash ID. When you go to check in your changes on a shared repository, you only have to upload a single copy of each new blob. Neat, huh?

It's getting lateish, so I'll continue this little dissertation this evening. Happy hacking!