Of blobs, and hash, and CVS...
2006-10-10 09:26 am...and other geeky things.
Since I've started to use a version control system called git for my recording projects,
and since the subject has come up in the comments to my last post, I
thought I'd dive a little deeper into git and why I'm using
it. (This is a good summary of
git's features.)
Unlike CVS and Subversion, but like Mercurial, BitKeeper and SVK,
git is a distributed version control system. That
means that, instead of having a single centralized repository that
represents the authoritative version and all of its history and branches,
every working copy includes its own repository. A project can
also have shared repositories. Operations like
branching, tagging, cloning repositories, and creating and merging patches
are blindingly fast, and git scales up easily to huge
projects and long histories. This is not surprising, since it was
designed for the Linux kernel.
A git repository is just a directory tree full of files; you
can access it any way you like: in the filesystem, through an ordinary web
server, via rsync, or through its own, somewhat more
efficient, protocol. This is a distinct advantage over CVS, which can
only use its own protocol, and Subversion, which can use the web but
requires you to have access to the server's configuration so that you can
set up WebDAV. You can just plop a git repository down on
your ISP's server, and it will work.
Now, let's get more specific. Down in its ingenious guts,
git is little more than a representation for unique instances
of files. A "blob" is a file that has been preceeded with a small header,
SHA-1 hashed, and compressed (with deflate). Its identifier is the hash,
which makes blobs unique and immutable. The commit log is just a chain of
blobs, with each one referring to (the hash of) its predecessor. A
directory is represented by a blob that contains an easy-to-parse listing
of filenames, mode bits, and blob hashes. (There are also "packs", which
are files that pull together many blobs, for storage efficiency, along
with a hash table for fast access.)
One of the consequences of this design is that it's particularly easy to
write small, specialized programs that operate on git blobs
and repositories, and that's really all git is -- a
collection of over 100 little programs that somehow manage to add up to a
version control system.
All of this makes git almost ideal for collaborating on music
using Audacity. That's
because an Audacity project is a directory full of 1MB raw audio files
tied together by an XML file. The audio files are never modified -- if
you want to apply an effect to a track, it makes a modified copy. This
means that if you copy an Audacity project to make your own changes, all
the hundreds of megabytes of audio that you copy in the process will
take up no extra space at all in the git repository.
You haven't changed them, so they still have the same hash ID. When you
go to check in your changes on a shared repository, you only have to
upload a single copy of each new blob. Neat, huh?
It's getting lateish, so I'll continue this little dissertation this evening. Happy hacking!