the corner office

a blog, by Colin Pretorius

Files and document management

I'm devoting a few cycles to the issue of file system storage and representation. I know that very bright people have devoted lots of thought and research to this and it is exactly this sort of problem that fairy-dust operating systems like Longhorn are meant to help solve, but the small little problem domain I'm dealing with right now is filing and categorising all my "reference" files. (As an aside, all my files are stored in directories named after places from Middle Earth, and my reference folder is called Gondor - the scene with Gandalf digging through the old scrolls in the cinematic version of The Fellowship Of The Ring perfectly captured why I'd always chosen 'Gondor' for the reference stuff.) The saved web pages, web sites, ebooks, pdfs, Word documents, text documents, quickly typed notes and miscelleous files that comprise my library of electronic knowledge.

My current 'system' such as it is, is fairly simple: categorised directories, subcategories, and subcategories again, like a great big tree. The problem is that it forces a single category on a document or set of documents. What's more, some directories imply categorisation and others imply structure of a single 'block' of documents. For example, an html version of an electronic book usually has one or more 'images' or 'data' directories. Which directories have semantic purpose and which are purely internal document structure? What's more, directory names like Teach_Yourself_DATABASE_PROGRAMMING_WITH_VB5_in_21_Days_2nd_Ed are not very handy, and vb_21_2 doesn't really do the job either. Somewhere in between is a what-to-name-it minefield that causes me more stress than the issue deserves, but I can't help myself.

So while my system isn't unworkable, I'd very much like to overcome its limitations. There are a number of really great things I'd like to be able to do and they all revolve around the issue of categorisation and annotation. I'd like to be able to plop a bit of work about Java programming in Notes into a Notes and Java category. If it's related to web development, then I'd like to put it in a WebDev category too. I also want to be able to note that it's a document generated by me, and I want to be able to find other such documents with minimal effort. I want to be able to generate a list of ebooks, websites, etc. Of the many electronic books I have, it would be nice to store the usual book details, such as publishers, year published, editions, etc. I'd like a history of when things were checked in. I want to be able to add comments and descriptions and cross-references and notes as I work with documents.

Does this venture firmly into the realm of document management systems? It's not an area I've worked with before, so perhaps these problems have been already been solved. Perhaps someone will think "you dufus, Domino.doc does this!" It's worth investigation, I suppose.

Notes might provide a good way to store the metadata for this system because the ideas of multiple categorisation, full-text searching and so on are all already there. There are a few caveats, though. The first is that I don't want to store the documents as attachments: I want them stored on a filesystem so that I can dig around in the "raw" stuff if I like. So Notes would have to synchronise with the actual file system data (then again, so would other systems). I want the system to be cross-platform so that I can manage the library from Windows and Linux. Finally, (and perhaps contrarily), I would like the metadata to be easily browseable in a way that acts much like a file system browser, and it must be small and easy to use (or else I'll always just fall back to the file system, which would be pointless).

Another thing I want is to maintain a strong distinction between the 'archived' version of some reference info and a more accessible version. What I often do is zip up the original html version of a book. Then, when I need to use it, I unzip it into the data directory of a home web server. At some point, things get out of sync and it becomes a pain to clean up. A system that knows what's where and what's "checked out" would be great, so that if the fancy takes me I can keep the metadata and archived, raw files and simply regenerate the more publically accessed bits.

Well, this is what I want, and there's nothing stopping me from fidding with a simple little system to manage all of this. It's quite possible that I'll lose interest soon enough, but if not, it might be useful to others as well. That's assuming the wheel hasn't been invented already, of course. If you have any thoughts or ideas about how a problem like this should be solved, feel free to chime in.

{2004.08.16 15:35}

« Assignment done!

» Lazy week