Home > Programming, Software Design > The Google Base DNS

The Google Base DNS

September 22nd, 2006

There are interesting ramifications that come from decoupling data from its location, and we enjoy them every time we surf the web. An early step in this direction was the binding of a natural language name to an IP address. This step has carried us far as it makes the network both far more accessible to its users (humans), and more robust, since those bindings can be changed, thus insulating the natural language labels from the nitty-gritty of connecting millions of computers to each other. The ability for that binding to change makes a lot of sense: IP assignment need no longer chain content in any way (modulo DNS update propagation delays). However, this freedom aspect of the separation of natural language name from IP address primarily benefits the person running a service, i.e. domain name owners. My domain name may be significant to me — maybe it’s my company’s name — but my IP address has no meaning outside of packet routing. This system works great if every piece of content can be tied to a specific domain name, but what happens when every node on the network (human and machine) starts producing a heterogeneous, far-ranging panoply of content? Do we all need our own domain names for every piece of information we generate? Does a 1-to-1 binding even make sense? The answers, respectively: apparent chaos, no, and no.

That said, the proliferation of data on the Internet has, thus far, been largely tamed by search engines. I remember a time when I’d take multiple guesses at a URL because that was how one found information on the web. Sometimes this worked (“I wonder what the URL for Toyota’s website is?”), sometimes it didn’t (Make up your own joke). But now I don’t blindly guess at URLs, I just type what I want into the search box and it, in a way, removes a lot of the redundancy from natural language (e.g. many names for the same thing) by exploiting the complementary redundancy in my query, other people’s queries, and the way people link to things (e.g. many names for the same thing). So search engines are actually doing a better job at getting my packets routed to the right IP than the domain name system? I think so!

There are neither neighborhoods nor neighbors on the Internet; your address is irrelevant.

What matters is discoverability and identification. Search engines currently perform both of those tasks based largely on contextual information, so why not just skip the DNS system altogether and get straight to what search engines feed on: metadata. Perhaps instead of a link directly to my site, I could instead give you a saved search that will find my site. Well, let’s not go that far, I don’t want to be a complete slave to the vagaries of PageRank. So I want my information discoverable via metadata I annotate my information with, and I want it uniquely identifiable so that I, and all my friends, can shortcut the search engine to get at my information. I also want direct identification so that others can annotate my information with their own metadata (i.e. the text surrounding links that uniquely identify my information). What I don’t want is any of that to have any tie to the physical layer. None of that discovery or identification has anything to do with file systems or network routing.

So here’s the admittedly silly example: I made a Google Base entry for a short Ruby example/tutorial I recently published, and I wanted to pull an excerpt from it. Now that I have that entry, it doesn’t matter where I host my tutorial: if I switch domain names, I can just update my GBase entry, and, as long as that’s where you look for the tutorial, you will find it at the new location. So now I can pull an excerpt from the introduction to the tutorial using a GBase-based redirection system. Due to the cross-domain limitations of XmlHTTPRequest, I need to proxy the data from GBase on my own server, but there is no reason Google couldn’t provide the same service. If you look at the page source for the Crazy Quote Machine excerpt, you can see that I didn’t just copy and paste the text from the original web page. No, that would be a very Web 1.0 way of quoting a page. Instead, I embed the live content in the new page. Now anyone can host the crazy_quote.html file, and it will always have the first paragraph from my short tutorial, wherever I host that tutorial. To see how it works, take a look at the output of my GBase proxy: the information from the GBase entry is now formatted directly as JavaScript. The gbase_proxy will also, optionally, include the full text found at included URLs (when it gets the data from GBase it can inspect the GBase metadata to discern what’s a URL and what isn’t) in order to avoid monkeying around with the aforementioned cross-domain limitations of XmlHTTPRequest. The code is trivial, consisting largely of regular expressions that extract data from HTML, but the end result is that using the new URL for the content lets me write things like gbaseItem.Author in my web page to get the metadata field “Author” associated with the information I’m using, and frees me from worrying about updating links should the host change.

This system comes with more than its fair share of holes: it needs DNS to find Google, and it requires that data authors maintain their Google Base entries. Some tentative solutions to these problems have been suggested and talked about in earlier posts, but, mainly, I hope this working example is at least somewhat entertaining to the interested reader.

Anthony Programming, Software Design

Comments are closed.