Message-Id: <9404300210.AA13921@expresso.bunyip.com>
From: Peter Deutsch <peterd@bunyip.com>
Date: Fri, 29 Apr 1994 22:10:17 -0400
In-Reply-To: Alexander Dupuy's message as of Apr 26, 12:35
To: dupuy@smarts.com (Alexander Dupuy), uri@bunyip.com
Subject: Re: MD5 and LIFNs (was: Misc Comments)
[ Alexander Dupuy wrote: ]
. . .
> I am in violent agreement with you that MD5-based URNs can be useful. My
> criticism of MD5 was in response to the original suggestion that a *single*
> "MD5" namespace authority could be defined to provide a URN namespace. I
> believe you have not disputed the points I made, but rather suggested that MD5
> is a useful tool for generating URNs, and this is certainly true. To be
> concrete about it, the original message was suggesting that anyone could
> generate URNs which looked something like:
>
> [1] URN:MD5:<hexadecimal md5 digest here>
>
> while I believe that you are suggesting that a coordinated archie-like URN
> provider could use md5 to generate URNs for electronic documents available via
> FTP; these would look something like:
>
> [2] URN:bunyip.urn.int:md5:<hexadecimal md5 digest here>
>
> or better (in view of the possibility that some files may share md5 digests)
>
> [3] URN:bunyip.urn.int:md5:<hexadecimal md5 digest here>:<size>:<serial>
> where the document size is used to differentiate between files that have
> matching md5 digests, and a serial number is appended to deal with the remote
> possibility that two documents may have the same size and md5 digest. Note
> that this serial number implies that some entity is actively indexing the URNs
> it issues and can determine how many URNs with the same md5 digest and size
> have already been issued.
I agree that size would be a nice addition to help a bit
with potential collisions, but the way I see calculating
these I don't think serial numbers are practical in this
case. In my mind, we have to ask the ftp servers to
calculate the checksums (otherwise, we'd have to copy the
file over to do it ourselves) and since we obviously can't
compare the files that produce collisions from various
sites for equality, how would we be able to assign
differing serial numbers? I think we should accept that
this technique in this case is a second-best alternative
and not sweat it too much.
> The difference between [1] and [2,3] is that I don't think it is feasible in
> the long term for a *single* URN namespace authority to manage a
> non-hierarchical namespace which attempts to provide a URN for *every* digital
> document in the world.
I don't think we're looking at using this for "every"
document in the world. There are obviously better
techniques, where we can apply them. MD5 is an example of
a second choice where the first can't be applied (for
example, because the server is not the publisher and the
orginal publisher is not available to take responsibility
for assignment).
I think we have to accept that a) we wont have universal
coverage for any particular URN scheme and b) we wont be
100 percent accurate or reliable in all schemes, either.
In this context, MD5 checksums work for anonFTP and gopher
indexes and such and can be checked against archie to
dereference anonFTP or gopher files to a corresonponding
URL (in fact, in this case archie _is_ the URN->URL
conversion service). That's still better than we have
today and will be (I think) generally useful so whatever
we come up with here better allow for it.
If we want better functionality or reliability, we should
use another naming authority that uses another scheme.
That's why we need to keep the spec open and flexible, so
people can choose the degree of reliability and
functionality they need.
> . . . I do think it is feasible for any URN namespace
> authority which wants to to use md5 to generate URNs to do so (although they
> should be aware that md5 digests are not necessarily unique).
Yup, see above.
> Part of the problem with [1] is that there is no reason to believe that a URN
> generated in that way is actually resolvable (i.e. that anyone can get a URL
> from it). It just assumes that there is some entity out there which just
> indexes anything and everything. While this may be true for files in
> well-known FTP archives, it is less likely to be true for Compuserve uploads,
But I'm not trying to solve every problem in the one shot.
I'd be happy if we add these things to archie and they're
useful. I'd be happier if they're part of a general
solution, but I don't want to hold my breath that we get
this right on the first pass and it works for every
application in the same way. I just don't think that's
going to happen, nor do I think it the best solution for
the Internet as a whole.
>
> A more realistic scenario is that when I make something available via an FTP
> site, I might know that Bunyip will be indexing it. I can calculate the md5
> digest and size, check to see if Bunyip already has a URN with those values,
> and thereby determine what Bunyip's URN will be. If I were making it
> available via other means which aren't indexed by Bunyip, but are indexed by
> somebody else, then I would check the somebody else's URN servers instead.
The problem I have with this approach (where you calculate
the checksum, etc and deciding what someone else would be
doing) is that in this example it seems to me that you're
not actuallythe one tasked with assigning the URN and by
allowing each info server to take on that role you're
setting up a system where eventual conflicts are
inevitable. As the person assigning them in this example,
Bunyip must reserve the right to disagree with you. In
that case, why have you calculate it at all?
I have little faith in building a scalable, useable system
for this example which requires every participant to get
assignment of the URN right. The archie system seems to
work as well as it does essentially because there is so
little required from participants.
Basically, if you're available and we can read you we can
track you. If we go with MD5 checksums here, we might
require your site to install a new "ls", but after that,
we can ask for the calculations as we need them and your
are back out of the loop. If we require you to be
performing calculations and even assigning serial numbers
for us, I just don't see everyone getting it right, and
continuing to get it right over time.
Now, in problem spaces where a few people are involved
(eg. traditional publishers assigning ISBN equivalents,
etc) perhaps there's more hope. This is what indicates to
me the need for alternate naming authorities and higher
quality assignment schemes. We need to be flexible here,
depending upon the operational requirements and available
resources.
- peterd
--
-----------------------------------------------------------------------------
"What do thay got, a whole lot of sand? We got a hot crustacean band!
Each little clam here, know how to jam here! Under the Sea!"
-----------------------------------------------------------------------------