Per the discussion of URL syntax at today's IPP Telecon, below are
excerpts from Generic URI Syntax, RFC 2396 (updates RFC 1738 and 1808)
and Extended URIs (draft-masinter-url-i18n-02.txt).
- Ira McDonald (outside consultant at Xerox)
High North Inc
Network Working Group T. Berners-Lee
Request for Comments: 2396 MIT/LCS
Updates: 1808, 1738 R. Fielding
Category: Standards Track U.C. Irvine
Uniform Resource Identifiers (URI): Generic Syntax
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright (C) The Internet Society (1998). All Rights Reserved.
This paper describes a "superset" of operations that can be applied
to URI. It consists of both a grammar and a description of basic
functionality for URI. To understand what is a valid URI, both the
grammar and the associated description have to be studied. Some of
the functionality described is not applicable to all URI schemes, and
some operations are only possible when certain media types are
retrieved using the URI, regardless of the scheme used.
A Uniform Resource Identifier (URI) is a compact string of characters
for identifying an abstract or physical resource. This document
defines the generic syntax of URI, including both absolute and
relative forms, and guidelines for their use; it revises and replaces
the generic definitions in RFC 1738 and RFC 1808.
This document defines a grammar that is a superset of all valid URI,
such that an implementation can parse the common components of a URI
reference without knowing the scheme-specific requirements of every
possible identifier type. This document does not define a generative
grammar for URI; that task will be performed by the individual
specifications of each URI scheme.
Uniform Resource Identifiers (URI) provide a simple and extensible
means for identifying a resource. This specification of URI syntax
and semantics is derived from concepts introduced by the World Wide
Web global information initiative, whose use of such objects dates
from 1990 and is described in "Universal Resource Identifiers in WWW"
[RFC1630]. The specification of URI is designed to meet the
recommendations laid out in "Functional Recommendations for Internet
Resource Locators" [RFC1736] and "Functional Requirements for Uniform
Resource Names" [RFC1737].
This document updates and merges "Uniform Resource Locators"
[RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order
to define a single, generic syntax for all URI. It excludes those
portions of RFC 1738 that defined the specific syntax of individual
URL schemes; those portions will be updated as separate documents, as
will the process for registration of new URI schemes. This document
does not discuss the issues and recommendation for dealing with
characters outside of the US-ASCII character set [ASCII]; those
recommendations are discussed in a separate document.
All significant changes from the prior RFCs are noted in Appendix G.
1.1 Overview of URI
different contexts, thus permitting new applications or
protocols to leverage a pre-existing, large, and widely-used
set of resource identifiers.
A resource can be anything that has identity. Familiar
examples include an electronic document, an image, a service
(e.g., "today's weather report for Los Angeles"), and a
collection of other resources. Not all resources are network
"retrievable"; e.g., human beings, corporations, and bound
books in a library can also be considered resources.
The resource is the conceptual mapping to an entity or set of
entities, not necessarily the entity which corresponds to that
mapping at any particular instance in time. Thus, a resource
can remain constant even when its content---the entities to
which it currently corresponds---changes over time, provided
that the conceptual mapping is not changed in the process.
An identifier is an object that can act as a reference to
something that has identity. In the case of URI, the object is
a sequence of characters with a restricted syntax.
Having identified a resource, a system may perform a variety of
operations on the resource, as might be characterized by such words
as `access', `update', `replace', or `find attributes'.
1.2. URI, URL, and URN
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URI
that identify resources via a representation of their primary access
mechanism (e.g., their network "location"), rather than identifying
the resource by name or by some other attribute(s) of that resource.
The term "Uniform Resource Name" (URN) refers to the subset of URI
that are required to remain globally unique and persistent even when
the resource ceases to exist or becomes unavailable.
The URI scheme (Section 3.1) defines the namespace of the URI, and
thus may further restrict the syntax and semantics of identifiers
using that scheme. This specification defines those elements of the
URI syntax that are either required of all URI schemes or are common
to many URI schemes. It thus defines the syntax and semantics that
are needed to implement a scheme-independent parsing mechanism for
URI references, such that the scheme-dependent handling of a URI can
be postponed until the scheme-dependent semantics are needed. We use
the term URL below when describing syntax or semantics that only
apply to locators.
Although many URL schemes are named after protocols, this does not
imply that the only way to access the URL's resource is via the named
protocol. Gateways, proxies, caches, and name resolution services
might be used to access some resources, independent of the protocol
of their origin, and the resolution of some URL may require the use
of more than one protocol (e.g., both DNS and HTTP are typically used
to access an "http" URL's resource when it can't be found in a local
A URN differs from a URL in that it's primary purpose is persistent
labeling of a resource with an identifier. That identifier is drawn
from one of a set of defined namespaces, each of which has its own
set name structure and assignment procedures. The "urn" scheme has
been reserved to establish the requirements for a standardized URN
namespace, as defined in "URN Syntax" [RFC2141] and its related
Most of the examples in this specification demonstrate URL, since
they allow the most varied use of the syntax and often have a
hierarchical namespace. A parser of the URI syntax is capable of
parsing both URL and URN references as a generic URI; once the scheme
is determined, the scheme-specific parsing can be performed on the
generic URI components. In other words, the URI syntax is a superset
of the syntax of all URI schemes.
INTERNET-DRAFT Larry Masinter
draft-masinter-url-i18n-02 August 30, 1998
Expires in 6 months
Representing non-ASCII Characters in URIs and Extended URIs
URIs are defined as sequences of characters chosen from a limited
subset of the repertoire of ASCII characters, both for transmission in
network protocols and representation in spoken and written human
This document defines a uniform way of representing non-ASCII scripts
in URIs and in an extended 8-bit form (8URI), so these identifiers can
be used for the world's languages. The document gives guidelines for
the use and deployment of these forms in various elements of software
that deal with URIs.
URIs [RFC 2396] are defined as sequences of characters chosen from a
limited subset of the repertoire of ASCII characters. The characters
in URIs are frequently used for representing English words and
phrases; unfortunately, this leaves out most of the world, who do not
write merely with the letters A-Z.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.
This document defines two ways of representing non-ASCII characters in
resource identifiers: a URI syntax which is compatible with the
definition of URI syntax [RFC 2396], and a new syntax which is usable
in contexts where resource identifiers are transported within "8-bit"
environments. This new syntax is called an "8URI"; it is upward
compatible with the URI syntax, but is defined as a sequence of 8-bit
2.1 URI syntax
The standard definition of URIs [RFC 2396] requires that URIs be
represented with a very limited repertoire of characters which are a
subset of those characters representable in ASCII. URIs are defined as
a sequence of characters (since URIs may be written on paper or read
out loud) which my be represented as a sequence of 7-bit bytes.
Character sequences that include non-ASCII characters must be
transcribed to represent them in URIs. The transcription to be applied
to a character sequence before it is included in an element of a URI
(path, etc.) SHOULD be performed by:
1) representing the characters as a sequence of ISO 10646 characters.
2) "normalizing" the character sequence to reduce ambiguity.
[UNI15] defines several normalization forms; for the purpose
of representing characters in URIs, "Normalization Form CC".
3) encoding the result with the UTF-8 character encoding [RFC 2279]
4) using %HH hex-encoding [RFC 2396] to encode any octet that
does not correspond to an allowed, non-reserved character.
This syntax is consistent with the definition of the generic URI
syntax [RFC 2396], the URN syntax [RFC 2141], as well as recent URL
scheme definitions [RFC 2192], [RFC 2384].
2.2 8URI syntax
This specification defines a new protocol element, called an '8URI'.
An 8URI is similar to a URI in its use, but is different in that it is
solely for use in network protocols that allow the transport of octets
outside of the range allowed within URIs. An 8URI MAY have 8-bit
octets within it. An 8URI is represented using the same methods (1-4)
defined in section 2.1, but in step (4), octets with the leading bit
on need not be encoded; all characters outside of those explicitly
disallowed in RFC 2396 (reserved, delimiters, white space, unwise
special characters) MAY be represented directly by their UTF-8
An '8URI' for characters outside of the ASCII range will use
considerably less space than the corresponding hex-encoded URI.
Even within 8URIs, any octet sequence which would likely yield
ambiguous or incorrect results when printed or displayed and then
subsequently typed by a user SHOULD be hex-encoded.
Internet protocols that currently allow the designation of a URI may
be extended at some point to allow 8URIs as well as URIs, but this
extension must be done explicitly. Section 4 lays out some of the
software guidelines that will allow the deployment of 8URIs in
existing Internet Protocols.