Re: Wrappers for URLs

Rickard Schoultz (schoultz@othello.admin.kth.se)
Tue, 11 May 93 10:36:58 +0200

Date: Tue, 11 May 93 10:36:58 +0200
Message-Id: <9305110836.AA06718@mercutio.admin.kth.se>
From: Rickard Schoultz <schoultz@othello.admin.kth.se>,
To: uri@bunyip.com
In-Reply-To: <9305080145.AA05811@wilma.cs.utk.edu>
Subject: Re: Wrappers for URLs

Keith Moore <moore@cs.utk.edu> writes in <9305080145.AA05811@wilma.cs.utk.edu>:

> A URL may be represented in either of two formats. Exchange format is the
> format recommended for use in communications protocols between programs that
> use URLs. Print format allows URLs to be represented in media that have
> limitations on line length. Print format is recommended for representing
> URLs in printed form, and also in ordinary text files.

Since "print format" will also be a kind of "exchange format"
(between computer and human user) we would like to suggest that
the two formats are called _program form_ (for program-to-
program communication) and _human form_ (for program-to-human
and human-to-program communication).

> 1. A URL in "exchange format" is written entirely in printable, non-space
> ASCII characters with octets from the range from 21 to 7E hex, inclusive.
>
> (a) Any octet may be represented as '%' followed by two upper case hex
> digits.
>
> (b) Octets in the "safe" set { 21-24 hex, and 26-3B hex, 3D hex, and 3F-7E
> hex } may be encoded as the corresponding ASCII characters.
>
> (c) Octets outside the "safe" set MUST be represented in hex according to
> rule (a).

Good principles. We have only some questions about details:

"#" can't be used freely in URLs, since it starts an "anchorid".
Maybe 23 hex should then be excluded from the "safe" set?

Different other characters have a specific syntactic role in
different kinds of URLs, e.g.: "?" in "generic", "httpaddress",
"gopheraddress"; "/" in all kinds except "newsaddress" and
"telnetaddress". Should we say that in such cases %-encoding
can be used to specify as a "data character" what would
otherwise be a syntactically significant character? That would
make it possible for example to give an URL for FTP access to a
file "a:b" in directory pub on host othello.admin.kth.se as

ftp://othello.admin.kth.se/pub/a%3ab

(The present syntax for "path" doesn't allow the ":". Another
peculiarity is that "=" isn't allowed in "path". Why is that?)

> 3. An exchange format URL may be converted to "print format" by enclosing the
> URL with '<' and '>'. White space characters and line-breaks may appear
> in a print format URL, but these are entirely non-significant. To convert
> a print format URL to exchange format, remove the enclosing '<' and '>'
> characters and delete any internal white-space and line-breaks.

We would like to suggest a more user-friendly set of rules for
spaces and line-breaks in the human form of URLs:

A) White space may replace "%20", if it is not preceded or
followed by white space.

Thus the _very_ rare occurence of two consecutive spaces
must be shown in the human form with at least one of them
represented by "%20".

B) Line-breaks may be used in two cases:

1) After "%" followed by two hexadecimal digits or "/": In
this case the line-break is insignificant.

2) Instead of white space that does not follow immediately
after "/" or a %-representation: In this case the
line-break indicates one space character (represented by
"%20" in the program form).

By these rules we make it possible to use spaces, instead of the
ugly "%20", to indicate single spaces in the URL form intended
for human input and output to human readers. Long URLs can be
folded at the most natural points for human users: where there
is a space or at the syntactically most important boundaries.

As has already been pointed out in this discussion, non-ASCII
characters are used a lot in many countries outside USA. In a
language using the Latin script like Swedish almost 1/3 of all
words contain non-ASCII letters. In languages such as Greek,
Russian, Hindi and Chinese no ASCII letters at all are used.
For ordinary users in these countries it will be unacceptable to
see their everyday letters represented by %-headed hexadecimal
digit sequences.

We propose this solution:

C) In the human form of URLs non-ASCII characters may be used
provided a character set indicator is added to the URL
immediately before the closing ">". This indicator shall
have the syntax
"%:" charset
where "charset" is a value registered by IANA for MIME use.
The corresponding coded character set defines the mapping of
the non-ASCII character to a sequence of octets that can be
represented by the %-mechanism in the program form of the
URL.

Take as an example a file called

l<a">s mig

in the directory pub at host othello.admin.kth.se. Here <a">
is the character LATIN SMALL LETTER A WITH DIAERESIS, coded by
the octet E4 hex according to ISO-8859-1. (The name of the file
is Swedish for "read me".)

The human form URL for this file preferred in Sweden would be

<ftp://othello.admin.kth.se/pub/l<a">s mig%:iso-8859-1>

<a"> here would in reality be the non-ASCII character.

The program form would be

<ftp://othello.admin.kth.se/pub/l%e4s mig>

Why is it necessary to include a character set indicator in
these extended URLs containing non-ASCII characters? It's
because that makes the URL resistent to character-preserving
conversion between different coded character sets.

Say that the URL in the example is included in a text file coded
with ISO-8859-1, so the non-ASCII character is represented by
the octet E4. Then this file is transferred to a Mac and
therefore converted to the Macintosh character set. For the Mac
user it will look exactly as intended, containing a lowecase a
with diaeresis (which is necessary to form the Swedish
expression for "read me"). In the Mac file, however, this
letter will be represented by 8A instead of E4. Thanks to the
character set indicator %:iso-8859-1 a client program on the Mac
will however be able to feed the right octet E4 to the FTP
program to fetch the correct file from the FTP server (where
ISO-8859-1 is used in file names).

It could be argued that occasional non-ASCII letters is nothing
to make a fuss about: European users can be taught to read and
input %-sequences instead. But consider a Greek FTP server,
where almost the whole path of a file is written with Greek
letters using ISO-8859-7. In that case URLs will be almost
three times as long and consist of mostly a soup of "%" and
hexadecimal digits, interspersed with "/". Such URLs will be
unusable for humans, unless some way of using the real non-ASCII
letters is provided.

Another points in connection with internationalized URLs that
we would like to raise:

D) The hexadecimal %-headed representation used in the program
form is very inefficient. In countries with languages using
other scripts than the Latin URLs may be almost three times
as long as in English-speaking countries. To reduce this
unfairness we could, in addition to the %-representation.
include a &-representation: After "&" would follow a sequence
of octets encoded by a BASE64-like method into a 33 %
(instead of 300 %) longer sequence of the characters A-Z,
a-z, 0-9, + and -. This sequence would be ended by a
second "&".

--
Rickard Schoultz			schoultz@admin.kth.se
SUNET/KTH				+46-8-790 90 88   (voice)
S-100 44 Stockholm (SWEDEN)	    	+46-8-10 25 10    (fax)

Olle Jarnefors Internet: ojarnef@admin.kth.se Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef Royal Institute of Technology (KTH) BITNET: ojarnef@sekth Fax:+46 8 10 25 10 S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0200)

Peter Svanberg, NADA, KTH Email: psv@nada.kth.se Dept of Num An & CS, Royal Inst of Tech Phone: +46 8 790 71 40 S-100 44 Stockholm, SWEDEN Fax: +46 8 790 09 30