Date: Tue, 29 Mar 94 10:48:13 +0200
From: Tim Berners-Lee <timbl@ptpc00.cern.ch>
Message-Id: <9403290848.AA00745@ptpc00.cern.ch>
To: "Milan Sova" <sova@feld.cvut.cz>
Subject: Character set in URL
In "Uniform Resource Locators (URL)" - draft-ietf-uri-url-03 -
Tim wrote:
> .... Where the local naming
> scheme uses ASCII characters which are not allowed in the URL,
> these may be represented in the URL by a percent sign "%" followed
> by two hexadecimal digits (0-9, A-F) giving the ISO Latin 1 code
> for that character. ...
"Milan Sova" <sova@feld.cvut.cz> said,
> This excludes from using in URL any character not included in
> ISO 8895-1 charset. That means that eg. I won't be able to show my boss
> THE URL pointing to his record in our directory service (his name contains
> three ISO Latin 2 characters).
> I think the charset information should be included in the URL
> syntax ( hard-wiring any charset into a protocol must bring troubles
> sooner or later - eg. SMTP ).
Larry Masinter replied,
> I agree, I think we need to reword the document to make it clear that
> the %-escaped codes in URLs do not correspond to 'ISO Latin 1 code'
> but just the binary encoding.
So there are there possibilities we have now:
1. URL is defined to be binary, passed on to underlying
protocol.
2. URL spec is defined to be a specific 8-bit set, Latin-1.
3. URL spec is defined to be an 8-bit set defined with each URL.
Milos is right when he says that (2) is presumptuous and broken, but I think
to go for (3) is to open a can of worms. I think Larry's suggestion is
correct, but we have to be aware of the fact that if our Danish colleagues
read Mikos's boss's URL, and their client software assumes Latin-1 encoding
when displaying it to him,
then the result will be gibberish. But gibberish which works.
Which is what URLs are for. Not for looking at.
I shall (modulo outcry) amend the spec to say that any character set outside
ASCII is not defined, and it isn't even defined that ASCII is the base set.
A URL could just be straight kanji with no warning. Or binary. The important
thing is that whatever it is is stuck into the protocol. Of course, most
old protocols are defined in terms of NET ASCII. Roll on Unicode for MIME,
and Plan9.
Tim