mdlbear | Of blobs, and hash, and CVS...

...and other geeky things.

Since I've started to use a version control system called git for my recording projects, and since the subject has come up in the comments to my last post, I thought I'd dive a little deeper into git and why I'm using it. (This is a good summary of git's features.)

Unlike CVS and Subversion, but like Mercurial, BitKeeper and SVK, git is a distributed version control system. That means that, instead of having a single centralized repository that represents the authoritative version and all of its history and branches, every working copy includes its own repository. A project can also have shared repositories. Operations like branching, tagging, cloning repositories, and creating and merging patches are blindingly fast, and git scales up easily to huge projects and long histories. This is not surprising, since it was designed for the Linux kernel.

A git repository is just a directory tree full of files; you can access it any way you like: in the filesystem, through an ordinary web server, via rsync, or through its own, somewhat more efficient, protocol. This is a distinct advantage over CVS, which can only use its own protocol, and Subversion, which can use the web but requires you to have access to the server's configuration so that you can set up WebDAV. You can just plop a git repository down on your ISP's server, and it will work.

Now, let's get more specific. Down in its ingenious guts, git is little more than a representation for unique instances of files. A "blob" is a file that has been preceeded with a small header, SHA-1 hashed, and compressed (with deflate). Its identifier is the hash, which makes blobs unique and immutable. The commit log is just a chain of blobs, with each one referring to (the hash of) its predecessor. A directory is represented by a blob that contains an easy-to-parse listing of filenames, mode bits, and blob hashes. (There are also "packs", which are files that pull together many blobs, for storage efficiency, along with a hash table for fast access.)

One of the consequences of this design is that it's particularly easy to write small, specialized programs that operate on git blobs and repositories, and that's really all git is -- a collection of over 100 little programs that somehow manage to add up to a version control system.

All of this makes git almost ideal for collaborating on music using Audacity. That's because an Audacity project is a directory full of 1MB raw audio files tied together by an XML file. The audio files are never modified -- if you want to apply an effect to a track, it makes a modified copy. This means that if you copy an Audacity project to make your own changes, all the hundreds of megabytes of audio that you copy in the process will take up no extra space at all in the git repository. You haven't changed them, so they still have the same hash ID. When you go to check in your changes on a shared repository, you only have to upload a single copy of each new blob. Neat, huh?

It's getting lateish, so I'll continue this little dissertation this evening. Happy hacking!

Flat | Top-Level Comments Only

From:

aerowolf.livejournal.com

There's a very important thing for VCSs for multimedia, that I'm not seeing git (or really anything) as having: Arbitrary blobs of metadata, associated with repository blobs, which can be accessed separately from the blobs that they refer to (i.e., I should be able to access any blob of metadata without having to check out the entire file).

In light of your earlier post, one of the larger issues is this: Yes, you have 1MB raw audio files tied together by an XML file. These audio files are at your sampling rate and recording quality. Why shouldn't there be a means of reducing the quality (and thus the size) for a "quick preview" to be downloaded separately from the whole recording?

For that matter, why shouldn't bitmaps at an arbitrary resolution of the waveforms and the FFTs of the waveforms be generated? Why shouldn't individual sections be addressable separately without having to download the entire high-quality file?

Atop this, where's the ability to assign credits (in the filk world, Music, Lyrics, Performer, Producer, Permissions [think Creative Commons] and derivative work information, among other things) to the blobs themselves? How about ways to purge specific blobs (not only metadata, but also of large data that doesn't need to be maintained any longer)? How about migration of blobs to tape or some other backup/archival medium?

Design this, and you'll have a much better tool for multimedia collaboration. (I'm already thinking of ways to do this, but I need to get a LOT more experience with VCS design before I can code even a proof-of-concept.)

$mdlbear: blue fractal bear with text "since 2002" (Default)$

mdlbear

The system I'm working on at work has many of those features; it's not done yet. My present purpose is not to create the ultimate MM collaboration system, but to use existing cross-platform tools to get a small number of tasks accomplished sometime in the next two months.

In fact, all the necessary metadata exists elsewhere in the filesystem, and is mostly up online. Audacity is an audio editor -- it would serve no purpose to attach metadata to every blob, to automagically generate compressed previews, and all the other stuff on your wishlist. The idea, again, is to work with the widest possible range of existing tools using the thinnest possible layer of additional metadata.

The Mandelbear's Musings

Of blobs, and hash, and CVS...

Profile

Page Summary

Active Entries

Expand Cut Tags

Of blobs, and hash, and CVS...

no subject

no subject

Links

Style Credit

The Mandelbear's Musings

Of blobs, and hash, and CVS...

Profile

Page Summary

Active Entries

Expand Cut Tags

Of blobs, and hash, and CVS...

no subject

no subject

Most Popular Tags

Links

Style Credit