Of blobs, and hash, and CVS...
2006-10-10 09:26 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
...and other geeky things.
Since I've started to use a version control system called git
for my recording projects,
and since the subject has come up in the comments to my last post, I
thought I'd dive a little deeper into git
and why I'm using
it. (This is a good summary of
git
's features.)
Unlike CVS and Subversion, but like Mercurial, BitKeeper and SVK,
git
is a distributed version control system. That
means that, instead of having a single centralized repository that
represents the authoritative version and all of its history and branches,
every working copy includes its own repository. A project can
also have shared repositories. Operations like
branching, tagging, cloning repositories, and creating and merging patches
are blindingly fast, and git
scales up easily to huge
projects and long histories. This is not surprising, since it was
designed for the Linux kernel.
A git
repository is just a directory tree full of files; you
can access it any way you like: in the filesystem, through an ordinary web
server, via rsync
, or through its own, somewhat more
efficient, protocol. This is a distinct advantage over CVS, which can
only use its own protocol, and Subversion, which can use the web but
requires you to have access to the server's configuration so that you can
set up WebDAV. You can just plop a git
repository down on
your ISP's server, and it will work.
Now, let's get more specific. Down in its ingenious guts,
git
is little more than a representation for unique instances
of files. A "blob" is a file that has been preceeded with a small header,
SHA-1 hashed, and compressed (with deflate). Its identifier is the hash,
which makes blobs unique and immutable. The commit log is just a chain of
blobs, with each one referring to (the hash of) its predecessor. A
directory is represented by a blob that contains an easy-to-parse listing
of filenames, mode bits, and blob hashes. (There are also "packs", which
are files that pull together many blobs, for storage efficiency, along
with a hash table for fast access.)
One of the consequences of this design is that it's particularly easy to
write small, specialized programs that operate on git
blobs
and repositories, and that's really all git
is -- a
collection of over 100 little programs that somehow manage to add up to a
version control system.
All of this makes git
almost ideal for collaborating on music
using Audacity. That's
because an Audacity project is a directory full of 1MB raw audio files
tied together by an XML file. The audio files are never modified -- if
you want to apply an effect to a track, it makes a modified copy. This
means that if you copy an Audacity project to make your own changes, all
the hundreds of megabytes of audio that you copy in the process will
take up no extra space at all in the git
repository.
You haven't changed them, so they still have the same hash ID. When you
go to check in your changes on a shared repository, you only have to
upload a single copy of each new blob. Neat, huh?
It's getting lateish, so I'll continue this little dissertation this evening. Happy hacking!
no subject
Date: 2006-10-10 05:28 pm (UTC)In light of your earlier post, one of the larger issues is this: Yes, you have 1MB raw audio files tied together by an XML file. These audio files are at your sampling rate and recording quality. Why shouldn't there be a means of reducing the quality (and thus the size) for a "quick preview" to be downloaded separately from the whole recording?
For that matter, why shouldn't bitmaps at an arbitrary resolution of the waveforms and the FFTs of the waveforms be generated? Why shouldn't individual sections be addressable separately without having to download the entire high-quality file?
Atop this, where's the ability to assign credits (in the filk world, Music, Lyrics, Performer, Producer, Permissions [think Creative Commons] and derivative work information, among other things) to the blobs themselves? How about ways to purge specific blobs (not only metadata, but also of large data that doesn't need to be maintained any longer)? How about migration of blobs to tape or some other backup/archival medium?
Design this, and you'll have a much better tool for multimedia collaboration. (I'm already thinking of ways to do this, but I need to get a LOT more experience with VCS design before I can code even a proof-of-concept.)
no subject
Date: 2006-10-10 06:21 pm (UTC)In fact, all the necessary metadata exists elsewhere in the filesystem, and is mostly up online. Audacity is an audio editor -- it would serve no purpose to attach metadata to every blob, to automagically generate compressed previews, and all the other stuff on your wishlist. The idea, again, is to work with the widest possible range of existing tools using the thinnest possible layer of additional metadata.