Date: Sat, 30 Apr 1994 12:33:07 +0500
From: dupuy@smarts.com (Alexander Dupuy)
Message-Id: <9404301633.AA16144@brainy.smarts.com>
To: uri@bunyip.com, peterd@bunyip.com
Subject: Re: MD5 and LIFNs (was: Misc Comments)
> I agree that size would be a nice addition to help a bit
> with potential collisions, but the way I see calculating
> these I don't think serial numbers are practical in this
> case. In my mind, we have to ask the ftp servers to
> calculate the checksums (otherwise, we'd have to copy the
> file over to do it ourselves) and since we obviously can't
> compare the files that produce collisions from various
> sites for equality, how would we be able to assign
> differing serial numbers? I think we should accept that
> this technique in this case is a second-best alternative
> and not sweat it too much.
I think there are probably some heuristics that could be used to distinguish
between the same file appearing in different places and different files having
the same size and MD5 digest. One obvious one is to compare the filenames.
Two filenames which don't have any (case-insensitive) common six-character
substring are probably not variant names of the same file, but are rather
different names for different files. If the names don't match, you can
retrieve the files and compare them to make sure they really are different.
> The problem I have with this approach (where you calculate
> the checksum, etc and deciding what someone else would be
> doing) is that in this example it seems to me that you're
> not actuallythe one tasked with assigning the URN and by
> allowing each info server to take on that role you're
> setting up a system where eventual conflicts are
> inevitable. As the person assigning them in this example,
> Bunyip must reserve the right to disagree with you. In
> that case, why have you calculate it at all?
You don't need to calculate it. You can always wait for the next archie pass
to discover the file. But sometimes, it might be useful to know in advance
what the URN will be (so you can put the URN in the notice you post to netnews
or the web announcing the availability of this new file). But this is totally
optional, and as you say, Bunyip reserves the right to disagree with you.
> Basically, if you're available and we can read you we can
> track you. If we go with MD5 checksums here, we might
> require your site to install a new "ls", but after that,
> we can ask for the calculations as we need them and your
> are back out of the loop. If we require you to be
> performing calculations and even assigning serial numbers
> for us, I just don't see everyone getting it right, and
> continuing to get it right over time.
Agreed. Bunyip would be the only party required or authorized to assign
serial numbers. However, it would be useful if they make public the algorithm
used to generate serial numbers, so that others can make fairly accurate
estimates (and really, that's all they are, estimates) of what the eventual
URN for a document retrivable via FTP/gopher would be.
@alex