I’m Afraid of URIs
Kim Cameron writes about namespace changes relating to Microsoft’s Cardspace initiative. The explanations offered sound good, but it’s hard to not be somewhat annoyed if you’re the one patching your code as a result of this change. This also reminds me of a few unconnected experiences that revolve, at least somewhat, around the permanence of URIs. URIs used to denote namespaces often (typically?) aren’t actually valid URLs. They specify a transfer protocol, but they’re not actually meant to be used with that protocol (e.g. they don’t link to documentation about that namespace). It seems to me that this is doubling the burden on a mechanism that isn’t necessarily appropriate. I suppose the argument goes that you control your domain, so you can split that resource among its various responsibilities. Sounds shaky to me, but let’s see where it leads us.
I recently wrote about software widgets in today’s web climate. In that article, and the followup comment dialogue with Bruno Pedro, I basically stated that I worry about systems that rely on URLs just as I worry about screen scraping. Screen scraping is great because it sort of side-steps the issue of agreeing on a format (you make the content, you define the format), but it stinks because if you decide to re-format your content, my code breaks… even if the data stays the same! So let’s say that APP solves the format problem, let’s see where that leads us.
An occurrence that seems fairly common in the world of blogs is that of switching between blog services. I’m reminded of Stephen O’Grady’s migration diary which covers much more than blogs. Stephen put in the effort, but many blogs have died only to be reborn elsewhere with the old content relegated to who knows what fate. If the personal web page of the 1990s has become the blog (or Live Space, or Myspace, etc.), and if each person’s data is to be made available through their web presence, then the tying of URL to blog to person to information is fairly important. In this scenario, a URL change is disastrous.
So when I put it all together, I’m using my domain name to identify namespaces that are potentially distinct from the content served up via HTTP from that domain. I’m also using my domain name to locate information that isn’t intrinsically related to my domain. I think there’s a blog in there, too. Personally, I’m going to closely watch Google Base to see if it catches on. I could host my own data but have a unique Google Base identifier for it that I can edit to reflect changes in where I’m keeping my data. So how about rather than using a URI to identify my namespace, I identify it as this, which is a unique identifier, can be annotated with relevant metadata (like a link to documentation), and won’t screw anyone else up if I change the URL of my website.
Do I think XML namespaces should be defined as dependent on Google Base? No, but it’s a start. What I’d like to see is a system where I could look for information at an address like base.google.com/$hash which would fall back on an address consisting solely of $hash if the information wasn’t found at Google Base (along with support for permanent redirects). The hash lookup would be a distributed database, or, at the very least, resolved by someone other than Google, Yahoo, Microsoft, Amazon, etc. While using Google Base can be viewed as safe today with supporting logic like “It’s not like Google’s going to disappear tomorrow!”, the same could be said of Microsoft which is, after all, just how this post got started.
Have you seen http://purl.org/ ?
Joe,
I am clearly in line with some of what PURLs provide, but I don’t think they go far enough. I don’t want to be dependent on any one name resolution system, and I do want something concrete backing the name system. The advantage of using hashes as the ultimate “true” address of a document is that it is probably going to be unique (modulo the uniqueness of the data), and it alleviates the need to even bother with coming up with unique names for abstract data sources. Of course, names are useful for humans, so I support the idea of attaching putative names to pieces of data. The key is that the natural language name is not the identity of the document. Instead, the natural language name I give my data is only going to be one more piece of metadata that a search engine can use to find my data. If I do know the actual address (the hash code), then I could just use that. This is similar to the idea of name resolution on top of IP addresses, but lifts the restriction that names be unique or tied to a particular file system location at a particular network location. This also circumvents the problem of coming up with a structured enumeration system for all the world’s information. Instead, the numbers are assigned at random, and search engines do their best to give you the data you’re looking for if you don’t already know that random number.
I appreciate that PURLs are built on the existing infrastructure, and I actually am, not so secretly, a big believer in the power of the current system. If I’m going to contemplate using another name resolution system, I think I’d like it to go further than just being a potentially more stable version of what we have today.
Anthony,
I like your idea of using Google Base to save information about document location, but why not using some other more adequate service, like del.icio.us?
Bruno,
That’s a thought I hadn’t considered. My impression is that del.icio.us is very much document-centric, which makes me a bit skeptical about using it in a more general way. I also have to admit that I haven’t tried using their API, so I’m not clear on how user authentication would play into a generic data addressing scheme.
Really, though, the main reasons I suggested Google Base are that they will happily host your data for you (if that’s what you’d like), the metadata fields are extensible, and Google has a great infrastructure. My rationale for involving Google Base at all was to take advantage of the structure of the Internet: Google’s got a ton of bandwidth and storage, so let’s use it. Really I’m just addressing latency among the problems listed in these interesting slides on how a P2P DNS is inferior to today’s DNS. The problem of insertion attacks, however, seems like a real core issue rather than just vandalism. If I want to address massively more data, then I need to face that problem head-on and not rely on an approval process that can’t likely scale to the necessary extent.