Date: Thu, 15 Sep 94 15:06:16 -0400
From: Tim Berners-Lee <timbl@quag.lcs.mit.edu>
Message-Id: <9409151906.AA00511@quag.lcs.mit.edu>
To: Chris Weider <clw@mocha.bunyip.com>
Subject: Re: <URL:...> considered harmful
> From: Chris Weider <clw@mocha.bunyip.com>
> Since I am the one who proposed the wrapper in the first place,
let me state
> why I think we *still* need something like this, and suggest some
possibilities
> for a modified wrapper now that we threw out the URL: prefix at the
last IETF.
Leaving the URL wrapper thrown out is a good thing.
The URL spec is too important to wait for our wooly deliberations
over the plain text wrapper. So irresepctive of the spec, let
us deliberate.
> We still need a way to distinguish a URL in plain text. Using a
scheme-based
> recognition technique, which looks for a valid scheme and then
extracts the
> rest of the line (or the rest of the line up to the next white
space) has several
> problems. They are:
> Scheme recognition. The number of new schemes will constantly
increase.
> Thus, without a generic wrapper, sites which have not installed the
latest
> set of schemes into their extraction tool will not be able to
correctly
> identify valid URLs embedded into text. A *human* might be able to,
> if they are familiar with all the schemes,
> but there will still be many that are missed by an automated
scheme.
> (I'm disregarding here the actual resolution of the URL).
Agreed. Although in fact <[a-z0-9.]*:[a-zA-Z0-9/_.+etc]*>
will work fine without the "URL:".
We have to be wary of looking for something which will
work 100% of the time, as we can *never* have that, because
we can *never* exclude any syntax from cropping up
elsewheer in plain text. Thereis always the possibility
for ambiguity-- so we will always technically have a heuistic,
even though the <> convention means it works in all but
pathalogical cases.
> Line length. The proposals I've seen for the X.500 URL will
require far
Agreed -- a point for <>.
> Human recognition. What's my current algorithm? Look for colons
and
> then scan the surrounding text hoping to recognize some URL format?
> I think that we can be substantially more friendly than that.
Human recognition is not really a problem: the amazing brains we
have use context, from which it is clear that something is a
reference and a lot more besides: also, humans recognize mail
addresses as such in practice, so the *human* recognition of
the existence of the URL is not a problem I feel. There is
a delimiter problem which Dan rightly points out especially
with trailing punctuation -- which is why the <> are useful.
> So, having said that, let me propose a solution. I freely admit
that my
> suggested wrapper doesn't fit into the 'sgml'ish flavor of HTML.
> So. Two suggestions that may fit better...
>
> 1: Highly recommending the anchor syntax (with surrounding <A> and
</A>)
> for all URLs quoted in free text. This allows the immediate display
of any
> text based document (with the appropriate semantics) through
Mosaic.
I have a serious problem with this, in that you *can't*
use SGML syntax in plain text. Yes, I know Mosaic does it but is
is very sloppy practice. Text is plain or is SGML.
If it's SGML you have to escape all "<" and "&" in the text
to something else. As a markup language for
slightly enriched by human readable text, SGML stinks.
If you want to use SGML, just pipe the thing through
text2html.sed and may it real text/html, and then the
software will be able to handle it in a well defined way.
If we want to use plain text, though, we should keep it
easy to write and easy to read without extra tools.
A halfway house will be untenable and we will be cursing
it hence forth.
> 2: The development of a new tag, call it URI, for example,
> <uri ref="http:blah/blah/blah"> and highly recommending its use.
This is
> perhaps less general, but is a fairly useful hack in my opinion,
and allows
> all types of references to be placed inside.
Again, the objection to being SGMLish but simpler.
Beleive me, I tried this route a long time ago. :-}
> In either case, I hope that I've convinced you of the necessity of
a wrapper.
> Tools are already being developed to take (for example) e-mail and
extract
> the URLs: if we can make their job easier, I think that will be a
major win.
Yup -- but I want wysiwyg HTML.
> Yes, it does mean that we have to make some changes now, but I
> believe that this will save us a lot of trouble in several years.
Let us *not* change the spec. But let us adopt the convention
in practice. It is only a convenetion, and we can never enforce
what people put in plain text.
> Chris Weider
Tim