Message-Id: <9309201801.AA07129@interval.interval.com>
Date: Mon, 20 Sep 1993 11:01:37 -0800
To: timbl@nxoc01.cern.ch, mitra@path.net (Mitra)
From: winograd@interval.com (Terry Winograd)
Subject: Re: URN single or multiple variants
>Date: Mon, 20 Sep 93 17:29:15 +0200
>From: Tim Berners-Lee <timbl@www3.cern.ch>
>Subject: Re: URN single or multiple variants
>
>I agree in general with Mitra (that the same URN should
>refer to any variants, machine conversion, etc) when it
>comes to actual usage today with _documents_.
The problem with the notion of "variants" is that it is a slippery slope
and it will be very hard to make a usable definition of what it means for
two things (even documents) to be the "same." If we start from the most
stringent notion (sameness of byte sequences), we can identify many
different kinds of relationships between two "different" objects:
1. Reversible mechanical translation (e.g., conversion to a file system
with different EOL conventions, compression, encryption, etc.)
2. Lossy mechanical translation preserving some aspects of content
TeX source -> postscript (preserves appearance but not structure)
MSWord -> RTF (preserves most but not all structure)
Image -> compressed image (lossy with respect to the full image)
Image -> enhanced image (may throw away original information as well)
Formatted document -> ASCII (preserves text content, not appearance)
Language variants (e.g., changing from Mac character set to
some ISO set. Preserves as much as possible, but not all)
Source code-> Object (preserves program semantics, but not full content
this is analogous to TeX source -> postscript)
3. Non-mechanical translations (at least for now, mostly)
ASCII -> Formatted document (may add meaning not intended in the
original)
Audio -> Transcript (preserves words (mostly) but little else)
Text -> Spoken rendering (preserves words, adds new content)
4. Variants
Multiple "versions" of a document (words differ but it is the "same
document")
drafts leading to a fixed document
ongoing sequence of updated versions with no fixed point
except for "current version"
corrections to an already issued document
subsequent "editions"
Variant versions
natural language translations
audience dependence (e.g., "unix users" vs. "DOS users" version)
different "official version" (IETF vs. ISO version....)
and so on.
Rather than try to account for all of these, or arbitrarily legislate among
them, it seems that it will be more extensible to have a notion of
"identity with respect to..." and have URNs relative to this. For example,
there could be a URN for "the text content of the declaration of
independence" and another one for "The US Govt. standard SGML declaration
of independence" and yet another for "Andy Warhol's postscript rendering
of..." There would be formats for describing the relations between these
so that for operations requiring a particular kind of invariance, a person
(or program) could find the appropriate information.
This requires that each "information object" be given a characterization by
a "responsible party" as to what variance it is intended to cover. That
is, each of the above three examples in addition to having a unique
identifying string would have auxiliary information saying what constitutes
its "unique identity".
>From: mitra@path.net (Mitra)
> The question was whether:
>
>a) we assign a single URN to this (and pass it around with
>attribute information to distinguish the variant to choose) or
>
>b) assign a URN to each of the variants, with the attribute
>information telling us about the document we have.
I am saying that it is up to the creator of an information object for which
there will be a URN to specify just what constitutes a "variant" (meaining
an instance of the same URN), with a vocabulary of choices to select from.
I might create a URN for a paper I am writing and have all of the versions
be variants, and I might in addition create a URN for today's version, with
an appropriate link saying it is a version of the paper. Future versions
would have new URNs of the second kind but still be linked to the same one
of the first kind. Choice as to when to do this would be up to the
creator, not legislated by the protocol.
This makes the basic structure more complicated but in the end should
provide principled grounds for dealing with the huge tangle of real-world
complications that we are looking at. It can be extended in a natural way
to include relationships of derived information (e.g., summaries, indices,
etc.) and of annotation (responses, comments, glosses, edited versions (in
the sense of Shakespeare books), etc.)
--t
--------------------------------------------
Terry Winograd, Professor of Computer Science, Stanford University
1993 address:
Interval Research winograd@interval.com
1801 Page Mill Road 415/354-0854
Palo Alto, CA 94304 Fax: 415/354-0872
Long-term address:
Stanford University winograd@cs.stanford.edu
Computer Science Dept. 415/723-2780
Stanford, CA 95305-2140 Fax: 415/724-7411