Date: Wed, 10 Aug 1994 00:51:05 -0700
From: jak@violet.berkeley.edu (John A. Kunze)
Message-Id: <199408100751.AAA23331@violet.berkeley.edu>
To: uri@bunyip.com
Subject: Re: Draft URL document, for last call to be proposed standard RFC
Comments are interspersed in text below. The main one is that I think
we'd be making a mistake by not requiring hyphens to be encoded.
Details appear later on.
> From: Larry Masinter <masinter@parc.xerox.com>
> Date: Thu, 4 Aug 1994 15:46:18 PDT
>
> Uniform Resource Locators T. Berners-Lee
> draft-ietf-uri-url-06.txt L. Masinter
> ...
> 0. Abstract
>
> This document specifies a Uniform Resource Locator (URL), the
> syntax and semantics of formalized information for location and
> access of resources on the Internet.
^^^^^^^^^^^^^^^
The qualifier "on the Internet" is untrue. The mailto: and file:
schemes describe locations of resources that are not necessarily
available on the Internet.
The file: scheme just identifies a file on a host, independent of any
networking context. The mailto: scheme describes a mail "depository"
(eg, a mail robot, or an individual's mailbox) on some host to which
e-mail may be sent. That e-mail probably never touches the Internet if
both sending and receiving hosts are on the same LAN. Even if the sender
is on the Internet, the receiver need not be thanks to e-mail gateways.
> 2. Recommendations
>
> This document describes the syntax for "Uniform Resource Locators"
> (URLs): a compact representation of the location and access method
> for a resource available on the Internet. Just as there are many
^^^^^^^^^^^^^^^
Same comment as above.
> 2.1. URL SYNTAX
>
> URLs are written as follows:
>
> <scheme>:<scheme-specific-part>
>
> A the URL contains the name of the scheme being used (<scheme>)
^^^
Delete "the".
> 2.2. Reserved, unsafe, and encoded characters
> ...
> There are a number of characters whose use in URLs is _unsafe_;
> characters can be unsafe for a number of reasons. The characters
> "<" and ">" are unsafe because they are used as the delimiters
> around URLs in free text; the quote mark (""") is used to delimit
> URLs in other systems. The character "#" is unsafe because it is
^^^^^
"Other" than what? I'd change it to "some".
> used in World Wide Web and in other systems to delimit a URL from a
> fragment identifier that might follow it. Other characters are
> unsafe because gateways and other transport agents are sometimes
> known to modify such characters. All unsafe characters should
^^^
This big paragraph needs air. Split out new paragraph starting with "All".
> always be encoded within a URL. For example, the character "#"
> should always be encoded within URLs, even in systems that do not
> normally deal with fragment identifiers, so that if the URL is
> copied into another system that does use fragments it will not be
> necessary to change the URL encoding.
>
> In general, only alphanumerics, reserved characters used for their
> reserved purposes, "$", "-", "_", ".", and "+" are safe and may be
^^^
We know from the Toronto meeting that human typesetters and editors
of newspapers are already introducing hyphens into the middle of URLs.
Because this invalidates the URL, I'd say that the hyphen is not only
not safe, but actually one of the more risky characters. I'd like
to see us at least make a statement about this risk.
Ideally I'd like to see us promote the practice of (a) not producing
URLs with unencoded hyphens and (b) stripping out unencoded hyphens
(along with unencoded whitespace) as if they weren't there at all.
The main problem is that in current practice folks use lots of hyphens
(eg, draft-ietf-uri-url-06.txt!). But since we don't let the installed
base sway us, now we can do it right.
> transmitted unencoded. Even so, safe characters _may_ be encoded
> within the scheme specific part of a URL.
> ...
>
> 3.1. Common Internet Scheme Syntax
> ...
> host
> The fully qualified domain name of a network host, or its IP
> address as a set of four decimal digits separated by periods.
> Fully qualified domain names take the form as described in
^^
Unneeded word ("as").
> Section 3.5 of RFC 1034: a sequence of parts separated by
> period.
>
> port
> The (optional) port number to connect to. Most schemes
> designate protocols that have a default port number. Another
> port number may optionally be supplied, in decimal, separated
> from the host by a colon.
For the <port>, need to mention whether the colon is optional. Also,
it would be helpful to say what the semantics of having two ports is.
> url-path
> The rest of the locator consists of data specific to the
> scheme, and is known as the "url-path". It supplies the
> details of how the specified resource can be accessed. Note
> that the "/" between the host (or port) and the url-path is
> NOT part of the url-path.
>
> The url-path is interpreted in a manner dependent on the scheme
> being used.
This last sentence seems redundant. Remove?
> 3.5. MAILTO
>
> The mailto URL scheme is used to designate the Internet mailing
> address of an individual or service. No additional information
> other than an Internet mailing address is present or implied.
What's the resource being located? I tried on the concept of "depository"
(above), but maybe there's a better fit.
> APPENDIX: Recommendations for URLs in Context
> ...
> In some cases, extra whitespace may need to be added to break long
It would help to identify whitespace as spaces, tabs, newlines, carriage
returns, form feeds, etc.
> URLs across lines. The whitespace is ignored when extracting the
> URL. In the case where a fragment identifier is associated with a
> URL (following a "#"), the identifier would be placed within the
> brackets as well.
>
> Examples
>
> Yes, Jim, I found it under <URL:ftp://info.cern.ch/pub/www/doc
> ;type=d> but you can probably pick it up from <URL:ftp://ds.inter
> nic.net/rfc>. Note the warning in <URL:http://ds.internic.net/
> instructions/overview.html#WARNING>.
This example(s?) could use formatting as an indented/quoted paragraph
with an introductory sentence tacked in front. Right now the "Yes, Jim"
is jarring enough that the reader wastes time trying to figure out if
the text was included in error.
-John