IPP> MOD - RFC 2396 (URIs) and new '8URI' draft

Wed Oct 7 20:34:43 EDT 1998

Hi folks,                                     Wednesday (7 October 1998)

Per the discussion of URL syntax at today's IPP Telecon, below are
excerpts from Generic URI Syntax, RFC 2396 (updates RFC 1738 and 1808)
and Extended URIs (draft-masinter-url-i18n-02.txt).

Cheers,
- Ira McDonald (outside consultant at Xerox)
  High North Inc
  906-494-2434

------------------------------------------------------------------------

Network Working Group                                     T. Berners-Lee
Request for Comments: 2396                                       MIT/LCS
Updates: 1808, 1738                                          R. Fielding
Category: Standards Track                                    U.C. Irvine
                                                             L. Masinter
                                                       Xerox Corporation
                                                             August 1998

           Uniform Resource Identifiers (URI): Generic Syntax

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

IESG Note

   This paper describes a "superset" of operations that can be applied
   to URI.  It consists of both a grammar and a description of basic
   functionality for URI.  To understand what is a valid URI, both the
   grammar and the associated description have to be studied.  Some of
   the functionality described is not applicable to all URI schemes, and
   some operations are only possible when certain media types are
   retrieved using the URI, regardless of the scheme used.

Abstract

   A Uniform Resource Identifier (URI) is a compact string of characters
   for identifying an abstract or physical resource.  This document
   defines the generic syntax of URI, including both absolute and
   relative forms, and guidelines for their use; it revises and replaces
   the generic definitions in RFC 1738 and RFC 1808.

   This document defines a grammar that is a superset of all valid URI,
   such that an implementation can parse the common components of a URI
   reference without knowing the scheme-specific requirements of every
   possible identifier type.  This document does not define a generative
   grammar for URI; that task will be performed by the individual
   specifications of each URI scheme.

1. Introduction

   Uniform Resource Identifiers (URI) provide a simple and extensible
   means for identifying a resource.  This specification of URI syntax
   and semantics is derived from concepts introduced by the World Wide
   Web global information initiative, whose use of such objects dates
   from 1990 and is described in "Universal Resource Identifiers in WWW"
   [RFC1630].  The specification of URI is designed to meet the
   recommendations laid out in "Functional Recommendations for Internet
   Resource Locators" [RFC1736] and "Functional Requirements for Uniform
   Resource Names" [RFC1737].

   This document updates and merges "Uniform Resource Locators"
   [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order
   to define a single, generic syntax for all URI.  It excludes those
   portions of RFC 1738 that defined the specific syntax of individual
   URL schemes; those portions will be updated as separate documents, as
   will the process for registration of new URI schemes.  This document
   does not discuss the issues and recommendation for dealing with
   characters outside of the US-ASCII character set [ASCII]; those
   recommendations are discussed in a separate document.

   All significant changes from the prior RFCs are noted in Appendix G.

1.1 Overview of URI

         different contexts, thus permitting new applications or
         protocols to leverage a pre-existing, large, and widely-used
         set of resource identifiers.

      Resource
         A resource can be anything that has identity.  Familiar
         examples include an electronic document, an image, a service
         (e.g., "today's weather report for Los Angeles"), and a
         collection of other resources.  Not all resources are network
         "retrievable"; e.g., human beings, corporations, and bound
         books in a library can also be considered resources.

         The resource is the conceptual mapping to an entity or set of
         entities, not necessarily the entity which corresponds to that
         mapping at any particular instance in time.  Thus, a resource
         can remain constant even when its content---the entities to
         which it currently corresponds---changes over time, provided
         that the conceptual mapping is not changed in the process.

      Identifier
         An identifier is an object that can act as a reference to
         something that has identity.  In the case of URI, the object is
         a sequence of characters with a restricted syntax.

   Having identified a resource, a system may perform a variety of
   operations on the resource, as might be characterized by such words
   as `access', `update', `replace', or `find attributes'.

1.2. URI, URL, and URN

   A URI can be further classified as a locator, a name, or both.  The
   term "Uniform Resource Locator" (URL) refers to the subset of URI
   that identify resources via a representation of their primary access
   mechanism (e.g., their network "location"), rather than identifying
   the resource by name or by some other attribute(s) of that resource.
   The term "Uniform Resource Name" (URN) refers to the subset of URI
   that are required to remain globally unique and persistent even when
   the resource ceases to exist or becomes unavailable.

   The URI scheme (Section 3.1) defines the namespace of the URI, and
   thus may further restrict the syntax and semantics of identifiers
   using that scheme.  This specification defines those elements of the
   URI syntax that are either required of all URI schemes or are common
   to many URI schemes.  It thus defines the syntax and semantics that
   are needed to implement a scheme-independent parsing mechanism for
   URI references, such that the scheme-dependent handling of a URI can
   be postponed until the scheme-dependent semantics are needed.  We use
   the term URL below when describing syntax or semantics that only
   apply to locators.

   Although many URL schemes are named after protocols, this does not
   imply that the only way to access the URL's resource is via the named
   protocol.  Gateways, proxies, caches, and name resolution services
   might be used to access some resources, independent of the protocol
   of their origin, and the resolution of some URL may require the use
   of more than one protocol (e.g., both DNS and HTTP are typically used
   to access an "http" URL's resource when it can't be found in a local
   cache).

   A URN differs from a URL in that it's primary purpose is persistent
   labeling of a resource with an identifier.  That identifier is drawn
   from one of a set of defined namespaces, each of which has its own
   set name structure and assignment procedures.  The "urn" scheme has
   been reserved to establish the requirements for a standardized URN
   namespace, as defined in "URN Syntax" [RFC2141] and its related
   specifications.

   Most of the examples in this specification demonstrate URL, since
   they allow the most varied use of the syntax and often have a
   hierarchical namespace.  A parser of the URI syntax is capable of
   parsing both URL and URN references as a generic URI; once the scheme
   is determined, the scheme-specific parsing can be performed on the
   generic URI components.  In other words, the URI syntax is a superset
   of the syntax of all URI schemes.

------------------------------------------------------------------------

INTERNET-DRAFT                                         Larry Masinter
                                                    Xerox Corporation
                                                        Martin Duerst
                                                  W3C/Keio University
draft-masinter-url-i18n-02                            August 30, 1998
Expires in 6 months

   Representing non-ASCII Characters in URIs and Extended URIs

[snip...]

Abstract

URIs are defined as sequences of characters chosen from a limited
subset of the repertoire of ASCII characters, both for transmission in
network protocols and representation in spoken and written human
communication.

This document defines a uniform way of representing non-ASCII scripts
in URIs and in an extended 8-bit form (8URI), so these identifiers can
be used for the world's languages. The document gives guidelines for
the use and deployment of these forms in various elements of software
that deal with URIs.

1. Introduction

URIs [RFC 2396] are defined as sequences of characters chosen from a
limited subset of the repertoire of ASCII characters.  The characters
in URIs are frequently used for representing English words and
phrases; unfortunately, this leaves out most of the world, who do not
write merely with the letters A-Z.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.

2. Syntax

This document defines two ways of representing non-ASCII characters in
resource identifiers: a URI syntax which is compatible with the
definition of URI syntax [RFC 2396], and a new syntax which is usable
in contexts where resource identifiers are transported within "8-bit"
environments. This new syntax is called an "8URI"; it is upward
compatible with the URI syntax, but is defined as a sequence of 8-bit
octets.

2.1 URI syntax

The standard definition of URIs [RFC 2396] requires that URIs be
represented with a very limited repertoire of characters which are a
subset of those characters representable in ASCII. URIs are defined as
a sequence of characters (since URIs may be written on paper or read
out loud) which my be represented as a sequence of 7-bit bytes.

Character sequences that include non-ASCII characters must be
transcribed to represent them in URIs. The transcription to be applied
to a character sequence before it is included in an element of a URI
(path, etc.) SHOULD be performed by:

1) representing the characters as a sequence of ISO 10646 characters.
2) "normalizing" the character sequence to reduce ambiguity.
   [UNI15] defines several normalization forms; for the purpose
   of representing characters in URIs, "Normalization Form CC".
3) encoding the result with the UTF-8 character encoding [RFC 2279]
4) using %HH hex-encoding [RFC 2396] to encode any octet that
   does not correspond to an allowed, non-reserved character.

This syntax is consistent with the definition of the generic URI
syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL
scheme definitions [RFC 2192], [RFC 2384].

2.2 8URI syntax

This specification defines a new protocol element, called an '8URI'.
An 8URI is similar to a URI in its use, but is different in that it is
solely for use in network protocols that allow the transport of octets
outside of the range allowed within URIs. An 8URI MAY have 8-bit
octets within it. An 8URI is represented using the same methods (1-4)
defined in section 2.1, but in step (4), octets with the leading bit
on need not be encoded; all characters outside of those explicitly
disallowed in RFC 2396 (reserved, delimiters, white space, unwise
special characters) MAY be represented directly by their UTF-8
encoding.

An '8URI' for characters outside of the ASCII range will use
considerably less space than the corresponding hex-encoded URI.

Even within 8URIs, any octet sequence which would likely yield
ambiguous or incorrect results when printed or displayed and then
subsequently typed by a user SHOULD be hex-encoded.

Internet protocols that currently allow the designation of a URI may
be extended at some point to allow 8URIs as well as URIs, but this
extension must be done explicitly. Section 4 lays out some of the
software guidelines that will allow the deployment of 8URIs in
existing Internet Protocols.

------------------------------------------------------------------------