Message-Id: <9410061520.AA06487@austin2.hal.com>
To: Peter Deutsch <peterd@bunyip.com>
Subject: Why URNs are a subset of URIs [Was: No "TOP" of the docuverse]
In-Reply-To: Your message of "Wed, 05 Oct 1994 20:41:05 EDT."
<9410060041.AA09204@expresso.bunyip.com>
Date: Thu, 06 Oct 1994 10:19:59 -0500
From: "Daniel W. Connolly" <connolly@hal.com>
In message <9410060041.AA09204@expresso.bunyip.com>, Peter Deutsch writes:
>[ Daniel W. Connolly wrote: ]
>. . .
>> Bingo. I don't know why folks go to such trouble to distinguish URLs
>> and URNs.
>
>Perhaps you don't understand what we want them for... :-)
>
>> . . . The URL concept grew out of the WWW addressing architecture
>> which was designed to include locationally transparent addresses[1].
>> The fact that such addresses have not yet been deployed is not a
>> design decision but a reflection of the fact that it takes time to
>> deploy technology.
>
>And the URN concept grew out of the need of services such
>as ours (archie and its follow-ons) to identify multiple
>instantiations of information independent of its location.
This feature -- call it "location independent names" or "identifying
multiple instantiations" -- is called "high availability" or
"replication" in the distributed computing literature that I have
read.
I do hope that URNs and URLs don't have to be incompatible just
because one came from WWW and the other came from archie. They're both
wonderfully useful applications, and the sooner they interoperate, the
better.
>When I get lots of archie hits I cannot simply compare
>their URLs for equality to see if they're the same
>document (because a URL identifies a resource's location
>but not its content), nor can I be sure that I found all
>copies of a document (because someone is free to rename a
>document and its URL would change).
And this feature is called "authentication." You want parties
to be able to certifiy that two references reference the "same"
thing, and you want other parties to be able to check that
information.
>Put another way, The requirements for URNs are intended to
>allow us to perform an entirely different set of
>operations on the named objects. In computer science
>terms, you compare URNs and dereference URLs. Thus, I
>submit that the difference between the two classes is real.
In practice, sometimes you compare URLs, and sometimes you dereference
them. Caching is already somewhat widely deployed for http: URLs. It
certainly involves comparing URLs. There are time-to-live issues, but
there are solutions to those issues, both heuristic and guaranteed.
And I bet that sometimes folks will compare URNs, and sometimes they
will dereference them using big databases.
>> The idea that URNs are somehow fundamentally different from URLs is
>> odd, and the proposals of deploying a namespace disjoint with the WWW
>> address syntax is just plain silly.
>I respectfully disagree with the above paragraph. The WWW
>address space is just that, an address space, along with
>accompanying protocol (and where appropriate, host)
>information. A URL gives you the information you need to
>access a copy of a resource.
Well... except that you have to go through DNS to find out the "real"
location of the resource. Have you read the cited notes by TimBL about
the distinction between names and addresses, and how they blur? I find
it quite convincing.
> It does _not_ allow me to
>perform the operation I need to perform, which is to compare
>multiple instantiations of resources for equality of
>content without examining the content itself. On the other
>hand, a suitable URN _will_ allow me to perform that
>operation. Ergo, URLs and URNs are not the same thing.
There are no URI schemes that allow this _yet_. So let's get busy and
deploy them! That doesn't mean we have to reinvent the syntax.
>(BTW, I certainly don't require URNs to have high
>availability nor authentication. I merely require that
>they identify content, not location.)
We must be using different definitions for the same terms -- otherwise
the above is pure doublespeak. Let's take for a working definition of
high availability:
A resource is _highly available_ if there is no single
point of failure between the producer and the consumer
of the resource. An optimal high availability strategy
will also result in consumers accessing the "nearest"
replica of a resource most of the time.
And for authentication:
A data entity E is an _authentic_ representation of
a name N at time t iff the owner of N has certified
that it is so.
For example, for resources that consist of a sequence of bytes, you
can use md5://... . The "owner" of all md5:// names is the special md5
principal, and E is an authentic representation of md5://sum iff
md5(E) = sum, independent of time.
For resources that change (like weathermaps), we need to deploy a
identifiers like urn://principal/name, where to check that E is an
authentic representation of urn://principal/name at t, we need a
signed certificate that says so, for example
(C, S)
where C = (principal, name, cksum, t0, t1)
and S is the RSA signature of C with principal's key. Once we have
obtained principal's key and verified C w.r.t. S, we check that
md5(e) = cksum and that t0 <= t <= t1. (Although time is a slippery
thing in distributed systems... we need to think about it some more).
>With that as background, let's consider a couple of
>scenarios.
>
>In the archie context, we plan to serve to our users both
>a location pointer and a content identifier at the same
>time. Thus, a search for the string fred might return:
>
> URN:12345 URL:ftp://site.com/pub/fred/
URN:45666 URL:gopher://site.com/usr/fred/
> URN:12345 URL:ftp://bozo.com/pub/fred/
> URN:59555 URL:ftp://mysite.edu/pub/fred/
>
>This allows me to see that the first and third entries are
>the same item, so I don't need to examine both
First we'll notice that modulo capitalization, the above URNs
are perfectly good WWW addresses (URIs, if you will). Writing
URL: in front of the gopher: and http: addresses is redundant.
Second, what sort of reliability/fault detection mechanisms accompany
the deployment of these URNs? That is, what's to stop rogue servers
from saying X is a copy of Y when it's really not?
If we replace URN: above with md5:, then the consumer can
independently check the authenticity of the ultimate resource, and
rogue servers are detected.
I think this is an essentail feature of the system. Not that strong
authentication must be used in every case -- I agree that there are
applications that don't require it -- but the overall system must
allow for authentication if it is to encompass valuable information.
I do agree that a _pair_ of URIs is a good data structure for a
reliable link. In fact, I first saw this idea in the WAIS documentation.
A WAIS docid was a tuple:
(origianl-server, original-db, original-local-id,
distributor-server, distributor-db, distributor-local-id)
When you distributed copies of stuff, you kept the original address
as part of the link.
I'd sure like to see that sort of thing deployed ASAP in
databases of mail and news. I'd like to be able to write:
<a href="http://gummo.xxx.yyy/www-talk/126.html"
urn="mid:203408kjljs@hal.com">
so that if some caching server in between me and gummo (like the
caching http server on my desk) knows that there's a copy of that mail
message somewhere closer, it can serve me that one instead. There's
no fault detection, but it's a start.
>Alternatively, I might want to do a search for a
>particular URN, say number 12345
Ah! So now we're dereferencing URNs! I thought so...
>Conceptually and practically there are still two different
>classes of identifier being used and of course getting to
>this ideal state will still require working with the
>installed base of URLs. There is a difference here and
>even if you don't need both, some of us most definitely
>do...
I agree we need more WWW addressing schemes (URI schemes, if you
like). I don't agree that URNs should be incompatible with URLs.
Dan