Re: URL revision

Daniel W. Connolly (connolly@hal.com)
Thu, 21 Jul 1994 10:45:24 -0500

Message-Id: <9407211545.AA12129@ulua.hal.com>
To: Larry Masinter <masinter@parc.xerox.com>
Subject: Re: URL revision
In-Reply-To: Your message of "Thu, 21 Jul 1994 00:19:42 PDT."
<94Jul21.001943pdt.2760@golden.parc.xerox.com>
Date: Thu, 21 Jul 1994 10:45:24 -0500
From: "Daniel W. Connolly" <connolly@hal.com>

In message <94Jul21.001943pdt.2760@golden.parc.xerox.com>, Larry Masinter write
s:
>Re: Hello? Did I miss it again? WHAT IS THE ARGUMENT IN FAVOR OF THIS?
> I recall about 5-7 'nays' on this issue and not a single 'yea'.
>
>This is the result of the survey for question 12:

I thought we weren't voting :-)

Doesn't it strike anybody else as harmful that URL: matches the syntax
of scheme:, so that you have to special case it? e.g....

# search for URLs in plain text...
while(<>){
if(/\b([a-z\.-]+):(\S+)/){
local($scheme, $path) = $1, $2;

if($scheme eq "URL"){
($scheme, $path) = $path =~ /([a-z\.-]+):(.*)/;
}
...
}
}

If we're dead set on using this thing, can we make it URL! or URL/ or
URL=?

>The primary purpose of the wrapper characters is to delimit the end of
>the URL, and to add an additional confirmation that the letters 'URL:'
>actually preceed a URL.
>
>I don't think any set of characters can do more than that, and
>especially not 'distinguish URLs from other things': that is the role
>of the URL: prefix.

Gotcha.

Well... if we're set on a URL prefix, let's get some use out of it. If
we want to be able to pick URLs out of plaintext and other data
formats reliably, we could adopt an optional checksum syntax:

URL(LL.SS)=scheme:...
where LL is the length in chars of the URL mod 0x100,
and SS is the sum of the bytes in the URL, mod 0x100.

some examples:
URL(2D.C2)=ftp://parcftp.parc.xerox.com/pub/ilu/ilu.html
URL(2B.FC)=http://martigny.ai.mit.edu/scheme-home.html
URL(34.8F)=http://www.cis.ohio-state.edu/htbin/rfc/rfc1468.html
URL(27.27)=ftp://ftp.isi.edu/in-notes/media-types/
URL(35.D0)=http://etext.virginia.edu/bin/tei-tocs?div=DIV1&id=GR
URL(23.30)=http://www.mecklerweb.com/demo.html
URL(1F.35)=ftp://ftp.netcom.com/pub/pearl/
URL(2C.0F)=http://www.cs.indiana.edu/elisp/w3/docs.html

The length allows you to know how many characters to expect, so you
can reconstruct URLs that got broken across lines. The checksum allows
you to be sure (with pretty high probability) that somebody put the URL
there on purpose, and it's not just random text that happens to match.

This would allow you to, for example, put a list of URLs in
postscript comments, or in a compiled binary, or any random data
format, and "URLgrep" them out reliably, much like the ident command
does for RCS keywords.

> Please note that as it is written "#" is *not* reserved,
>but rather, it is *universally unsafe*.
> ...
> (This follows from the decision that individual schemes may
>differ on what characters are reserved, even though they may not
>differ on what characters are unsafe.)

I am thoroughly confused now... I'll have to read it again a couple
times.

I thought the point was that everybody agrees on what's reserved, and
that the transport defines what's unsafe; e.g. in a reliable 8-bit
transport, safety is not an issue, but reserved characters are;

Conversely, in an unreliable transport like mail, you must leave the
reserved characters as-is, but you can escape/quote as many of the
other characters as you need to in order to get the URL through
intact. (What do you do with URL's over 72 characters long? bracket
them with <> and split them across lines?)

Dan