Message-Id: <9403251902.AA06109@ulua.hal.com>
To: hallam@alws.cern.ch, uri@bunyip.com
Subject: Formalisms for IIA [Was: LISP for Complex URC Sytax]
In-Reply-To: Your message of "Thu, 24 Mar 1994 20:58:58 +0100."
<9403241958.AA26232@dxmint.cern.ch>
Date: Fri, 25 Mar 1994 13:02:42 -0600
From: "Daniel W. Connolly" <connolly@hal.com>
In message <9403241958.AA26232@dxmint.cern.ch>, hallam@alws.cern.ch writes:
[Lots of good stuff...]
>
>Mitra:
>>So far, the players are:
>> SGML parsers
>> RFC822 header parsers
>> UR*-of-the-day parsers
Actually, I think I wrote that.
>I think we should aim rather on minimizing the complexity of the parser.
I agree. My experience says that if it can't be expressed in some
formalism (like lex and yacc) in a page or two, then it's too complex
and it won't be implemented consistently.
> Having
>a different syntax is not necessarily that much of a problem. Anyone writing a
>
>HTTP system in any case begins by writing an FSR compiler at the least.
FSR compiler... is this something like lex? I hope HTTP implementors
don't need one of these, though I agree it's usefull in this context.
>If we want a simple system we use LISP. If we want a standard we should use
>ASN.1
How about using ASN to define the essential nature of the beast, and
specifying a LISP encoding so that it can be parsed easily?
>
>Whatever we do we should not touch SGML with a bargepole. There is a practical
>benefit for using it as *A* (nb the indefinite article) format for text. Its
>a format we can expect most text processors to support.
>
My frustrations with SGML in my recent attempts to formalize the
internet information architecture are (1) that it really only deals
with characters. Typical protocol objects like octets and 32bit words
are not expressible in SGML, and (2) there's a mismatch between the
granularity that SGML lends itself to and the granularity of URLs and
the like. In order to express
http://host.com:4000/dir1/dir2/file1?word1+word2
in SGML such that the structure is available to the SGML parser, you'd
have to write:
<url><scheme>http</scheme><host>host.com</host>
<port>4000</port><path>dir1</path><path>dir2</path>
<path>file1</path><search>word1 words</search></url>
using attributes and tag minimization, you might reduce that to:
<url scheme="http"><host port="4000">host.com
<path>dir1<path>dir2<path>file1
<search>word1 words</url>
and if you supported SHORTTAG, you could use short references and
other tricks to make it more reasonable. But the parser expands all
that minimization before making it available to the application. So
the net effect is that you're dealing with verbose character
representations of structure.
SGML is designed for situations where most of what you're dealing with
is text -- there's just a little markup here and there. So it's ok to
make the markup verbose in order to encoding text convenient. But with
highly structured collections of small objects, it's a pain.
>Tim agin:
>>Maybe Dave Crocker and Dan Connolly should get together and
>>rigourously define a "cleaned up" version of SGML, with the
>>phase-of-the-moon white space handling Dan refers to in a
I no longer see the value in this. The real value of using SGML is
interoperability with other conforming SGML implementations. I suppose
it makes sense to say "in the IIA, we won't use marked sections or
short references" so that IIA lexical analyzers can be simple. This
way, any IIA SGML document can be used by a conformin SGML
implementation.
But on the other hand, an author may be surpirised when his/her
document acts wierdly in an IIA application after it has been
validated by a conforming SGML parser. And we lose much of the benefit
of using SGML at all.
>I think we should do this anyway. I have been ripped of to the tune of 50quid
>for the SGML manual. I find it marginaly less readble than the MVS debugger
>manual.
Apparently, nobody is expected to actually read the SGML standard.
Folks are supposed to read the handbook by Charles Goldfarb. I'm not
sure what the purpose of the standard is, exactly: it's not
sufficiently formal to specify the language without further
explanation, and yet it doesn't provide that very explanation.
>If the SGML crowd want to claim any
>`validation' ability I would first want to see at least a rigorous semantics
>for SGML.
Agreed. The only thing that ISO8879 specifies is how to reduce an
arbitrary stream of characters to one bit: 1 if it's "valid" and 0 if
it's not valid. It hints and alludes to characteristics of "valid"
documents (they have newlines that get "ignored" in some cases, for
example.)
> The correspondance between a DTD and the abstract data structure
>it encapsulates should be made plain as should the resulting correspondance
>defined beween the text of the document and its expression in the abstract data
>structure.
Apparently, though, complaints like these have not gone unnoticed. I
think there is a specification of the ESIS -- the information that an
application can get from a valid SGML document -- in a later standard.
There's also something called RAST -- Reference Appliation for SGML
Testing, which is essentially the reduction of an SGML document to a
standardized representation of the ESIS. The sgmls package by J. Clark
supports RAST now.
>
>I can't use SGML at present within a formal system. It is by no means clear to
>
>me that this is possible. Experience shows that a language designed informally
>
>is unlikely to be tractable as a formal system. This was the experience of the
>
>ADA team who ten years on have yet to produce a definitive semantics.
Agreed. Seconded. Amen. Though there is apparently a body of reasearch
regarding formal properties of SGML. It seems pretty bass-ackwards to
me. Had they discovered the formal properties of the language before
writing the standarnd, I expect an SGML parser would be a one
person-month project, rather than a hundred person-month project as it
is.
Enough SGML bashing... back to formalisms for IIA -- is ASN-1 a Good
Thing? I have some experience with it, but I have never completely
grokked it. I like the idea of defining the structure of an object
independent of the representation. I like the idea of defining an
object in terms of operations on the object even better. I guess the
strategy I'd like to see is:
1. Define the operations on the object in question (URLs, URNs,
URCs, links, HTML documents, IIA resources)
2. Define the information necessary to support those objects
abstractly, using set theory or ASN or some such.
3. Define one or more octet-stream representations of that
data for storage and transmission.
Dan
p.s. Anybody know where I get a set of ASN-1 tools to play with via
FTP? I'd like to do some homework on this beasty.