PMP Mail Archive: Re: PMP> URGENT: SYNTHESIS... [ASCII versus UTF-8]

Re: PMP> URGENT: SYNTHESIS... [ASCII versus UTF-8]

Ira Mcdonald x10962 (imcdonal@eso.mc.xerox.com)
Thu, 24 Jul 1997 09:20:33 PDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ira Mcdonald x10962: "Re: PMP> Localization conclusion - prtGeneralPrinterName"
Previous message: JK Martin: "Re: PMP> Localization conclusion - prtGeneralPrinterName"

Hi Michael,

What you wouldn't have gotten from reading RFC 2130 is that
UTF-8 is PART of ISO 10646, and is ONLY the encoding of a
particular form of UCS-2 Unicode. That is, while imprecise,
to say 'UTF-8' is the CES is to always ALSO say the Unicode
2-byte is the CCS.

Otherwise, your point would be well worth considering.
There is ONLY one set of glyphs associated with the code
points in Unicode UCS-2 (part of ISO 10646), so there is
NO ambiguity. In addition, the ASCII seven-bit glyphs
are ALWAYS part of the set, for all languages and
countries - that's the beauty of Unicode, and that's
why ALL new protocol standards shall use UTF-8 as their
base CES (and UCS-2 as their base CCS).

Cheers,
- Ira McDonald (outside consultant at Xerox)

PS - The 'shall' in the last sentence applies to new IETF
standards only and is part of the recommendations section
of RFC 2130 (IAB Character Set Workshop report).

------------------------------ Michael's note -------------------
Return-Path: <pmp-owner@pwg.org>
Received: from zombi (zombi.eso.mc.xerox.com) by snorkel.eso.mc.xerox.com (4.1/XeroxClient-1.1)
id AA15559; Wed, 23 Jul 97 20:36:51 EDT
Received: from alpha.xerox.com by zombi (4.1/SMI-4.1)
id AA16499; Wed, 23 Jul 97 20:33:39 EDT
Received: from lists.underscore.com ([199.125.85.31]) by alpha.xerox.com with SMTP id <52434(2)>; Wed, 23 Jul 1997 17:33:39 PDT
Received: from localhost (daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) with SMTP id UAA16817 for <imcdonal@eso.mc.xerox.com>; Wed, 23 Jul 1997 20:29:47 -0400 (EDT)
Received: by pwg.org (bulk_mailer v1.5); Wed, 23 Jul 1997 20:28:52 -0400
Received: (from daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) id UAA16698 for pmp-outgoing; Wed, 23 Jul 1997 20:27:58 -0400 (EDT)
Date: Wed, 23 Jul 1997 17:27:48 PDT
From: Michael Kirkham <mikek@iwl.com>
To: JK Martin <jkm@underscore.com>
Cc: pmp@pwg.org
Subject: Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to allow superset of ASCII
In-Reply-To: <199707232336.TAA09862@uscore.underscore.com>
Message-Id: <Pine.SUN.3.93.970723164251.6167E-100000@iwl.iwl.com>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: pmp-owner@pwg.org
Status: R

Pardon me for interjecting -- in general I'm only advising Chris, since
I'm not officially part of the working group -- but I think there's some
misunderstanding between various working group members about just what
UTF-8 is and what benefit or drawback there is in switching from ASCII to
UTF-8. Here is my interpretation (and I'm by no means a character set
or localization expert; feel free to correct me if I'm wrong)..

>From what I have read of RFC 2130, "The Report of the IAB Character Set
Workshop," and skimming through various other documents on UTF-8, it looks
to me on the surface that ASCII and UTF-8 are entirely different beasts,
and it is not simply a matter of switching from one to the other.

The way ASCII and UTF-8 differ is really in what they are, and what they
are trying to accomplish. ASCII is a "Coded Character Set," whereas UTF-8
is a "Character Encoding Scheme." RFC 2130 defines these to be:

3.2.1: Coded Character Set

A Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers. Examples of coded character sets
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
[ISO-8859].

3.2.2: Character Encoding Scheme

A Character Encoding Scheme (CES) is a mapping from a Coded
Character Set or several coded character sets to a set of octets.
Examples of Character Encoding Schemes are ISO 2022 [ISO-2022] and
UTF-8 [UTF-8]. A given CES is typically associated with a single
CCS; for example, UTF-8 applies only to ISO 10646.

To paraphrase, a Coded Character Set, such as ASCII, is the association of
a particular glyph (a graphical, visual representation such as the letter
A, or an asterisk) with a numerical value. (Example, in the case of
ASCII, the glyph of the letter "A" has a value of 65).

A Character Encoding Scheme, such as UTF-8 is, on the other hand, is a way
of encoding different Coded Character Sets (such as ASCII, or katakana, or
cyrillic) into a stream of octets, such that you can algorithmically
determine that a set of N octets is a single glyph, the next set of M
octets is another glyph, and so on -- but not to determine just what glyph
that is.

In other words, given a stream of octets that is encoded in a Character
Encoding Scheme such as UTF-8, there is no way to know intrinsically what
glyphs to display. A value encoded in UTF-8 is therefor meaningless
without knowing what Coded Character Set (such as ASCII) it was mapped
from -- unless you know, by virtue of Printer-Y being sold in Japan that
the Coded Character Set is katakana (but the Character Encoding Scheme
used to store that value in the MIB object is still UTF-8).

The association between ASCII and UTF-8 is what RFC 2130 simply refers to
as an "Exact Conversion" by nature of the ASCII Coded Character Set being
identical to ASCII-encoded-in-UTF-8 (RFC 2130 section 3.5 - 3.5.1).

What I'm getting at is that it doesn't seem to me that you can actually
gain anything on the localization issue by using UTF-8. Yes you can now
write katakana into a MIB object -- but you can't get katakana out, unless
something somewhere says that the value you are reading is katakana, or
unless you just *know*. If you don't simply know, it falls into what RFC
2130 calls "Determined by guessing" (section 3.3), and which it
discourages:

We recommend that each protocol clearly specify what it is using for
each of the layers of the transmission model. Users (or clients)
should never have to guess what the parameter is for a given layer.
(RFC 2130 section 3.4.1)

One problem that arises, and has been brought up, is how to handle the
case where Management Station A sets the value of a MIB Object and
Management Station B has to read the value and know what it means. This
is a problem in both (a) relaxing a DisplayString object to Octet String
and allowing any Coded Character Set -and- (b) encoding that Coded
Character Set in UTF-8, for the reasons outlined above.

This problem may not be addressable with the idea of having a single MIB
object specifying what Coded Character Set a group of objects are using --
a management station can still encode Cyrillic in UTF-8 and then set a MIB
object, even though some MIB object says that the MIB is using ISO-10646.

RFC 2130 also goes on to make recommendations about how you determine the
Coded Character Set of a given bit of text, such as MIME headers, etc. A
solution along these lines is beyond the scope of the Printer MIB. What
SNMP really needs in this area is it's own version of a MIME header in
both the SetRequest-PDU and the GetResponse-PDU, so that a management
station can issue a Set Request saying, "This is ISO-10646 encoded in
UTF-8," and later another management station can read the value with the
Get Response saying "This UTF-8 value should be decoded to ISO-10646."

This is, I believe, where future versions of SNMP are probably heading.

As I said, if I've misinterpreted the relationship between ASCII/Coded
Character Sets and UTF-8/Character Encoding Schemes, please let me know.
>From reading this discussion, I suspect nobody has an entirely clear
picture, and I fully admit I'm not entirely clear myself. This message
isn't directed at anyone in particular -- only trying to make sure
everyone sees the same picture.

On Wed, 23 Jul 1997, JK Martin wrote:

> I think Dave Kellerman's new proposal may very well be the best
> way to solve this last-minute crisis in a decent way.
>
> As I understand Dave's proposal, instead of having a fixed charset
> of ASCII, we simply move it to UTF-8, and thereby gain a reasonable
> amount of localization with absolutely minimal impact to the MIB.
>
> Also in support of Dave's proposal--and this means a lot to us as
> mgmt app developers--we stand a much better chance of having decent
> interoperability if we constrain the charset to UTF-8.
>
> I hope the PMP group accepts this new proposal for UTF-8, assuming
> of course no one pokes a big fat hole in it. ;-)
>
> ...jay
>
> ----- Begin Included Message -----
>
> >From pmp-owner@pwg.org Wed Jul 23 18:50 EDT 1997
> Date: Wed, 23 Jul 1997 15:46:51 PST
> From: David_Kellerman@nls.com
> To: pmp@pwg.org
> Subject: Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to
> allow superset of ASCII
>
> If there really is a broad interest in "fixing" the localization
> problem, I would suggest an alternative to Tom's proposal -- switch from
> ASCII to UTF-8 for OCTET STRING objects where representation of
> multilingual text is appropriate.
>
> Summary of arguments in favor: no new objects, consistent with existing
> conforming implementations (ASCII is subset of UTF-8), doesn't introduce
> the complexity of multiple character sets for affected objects, doesn't
> introduce the complexity of changeable character sets for affected
> objects, seems to be consistent with direction of IETF generally and
> SNMP in particular.
>
> Problems I see are, briefly: forces implementations to deal with UTF-8,
> and it conflicts with existing implementations that allow non-ASCII
> characters in the strings. How serious these are depends, in part, on
> whether you believe other MIB work is going to force UTF-8 anyway, and
> how much weight you want to give to existing practice that deviates from
> the existing standard.
>
> Supporting material:
> 1. See the note from Randy Presuhn that Chris forwarded to the mailing
> list. He suggests this approach, has obviously given the topic a
> lot of thought, and discusses it in some detail. He also asserts
> that the SNMPv3 effort is headed toward use of UTF-8 for all
> human-readable strings.
> 2. I read Harald Alvestrand's message differently than Tom. I think it
> says to specify the character set (a single one) and recommends
> UTF-8; not to allow multiple character sets, chosen at the
> discretion of the agent or application.
> 3. I also read RFC 2130 (The Character Set Workshop Report) differently
> than Tom. It covers a lot of ground, trying to address migration of
> existing protocols as well as new work. For new protcols in
> particular, it says in part:
> New protocols do not suffer from the need to be compatible with
> old 7-bit pipes. New protocol specifications SHOULD use ISO
> 10646 as the base charset unless there is an overriding need to
> use a different base character set.
>
> Here are the details of the changes to the document:
>
> 1. Copy the Utf8String TC from the sysAppl draft:
>
> Utf8String ::= TEXTUAL-CONVENTION
> DISPLAY-HINT "255a"
> STATUS current
> DESCRIPTION
> "To facilitate internationalization, this TC
> represents information taken from the ISO/IEC IS
> 10646-1 character set, encoded as an octet string
> using the UTF-8 character encoding scheme described
> in RFC 2044 [**]. For strings in 7-bit US-ASCII,
> there is no impact since the UTF-8 representation
> is identical to the US-ASCII encoding."
> SYNTAX OCTET STRING (SIZE (0..255))
>
> Stylistically, you might want to introduce a ShortUtf8String with
> SIZE (0..63) -- it would simplify many of the SYNTAX clauses (see
> below).
>
> 2. Change the SYNTAX for the following objects from OCTET STRING:
>
> prtGeneralCurrentOperator Utf8String (SIZE(0..127))
> prtGeneralServicePerson Utf8String (SIZE(0..127))
> prtGeneralSerialNumber Utf8String
> prtGeneralPrinterName Utf8String
>
> prtInputMediaName Utf8String (SIZE(0..63))
> prtInputName Utf8String (SIZE(0..63))
> prtInputVendorName Utf8String (SIZE(0..63))
> prtInputModel Utf8String (SIZE(0..63))
> prtInputVersion Utf8String (SIZE(0..63))
> prtInputSerialNumber Utf8String (SIZE(0..32))
>
> prtInputMediaType Utf8String (SIZE(0..63))
> prtInputMediaColor Utf8String (SIZE(0..63))
>
> prtOutputName Utf8String (SIZE(0..63))
> prtOutputVendorName Utf8String (SIZE(0..63))
> prtOutputModel Utf8String (SIZE(0..63))
> prtOutputVersion Utf8String (SIZE(0..63))
> prtOutputSerialNumber Utf8String (SIZE(0..63))
>
> prtMarkerColorantValue Utf8String
>
> prtChannelProtocolVersion Utf8String (SIZE(0..63))
>
> prtInterpreterLangLevel Utf8String (SIZE(0..31))
> prtInterpreterLangVersion Utf8String (SIZE(0..31))
> prtInterpreterVersion Utf8String (SIZE(0..31))
>
> 3. Add the reference to RFC 2044 to the bibliography:
>
> [**] F. Yergeau, "UTF-8, a transformation format of Unicode
> and ISO 10646", RFC 2044, October 1996.
>
> That's it.
>
> :: David Kellerman Northlake Software 503-228-3383
> :: david_kellerman@nls.com Portland, Oregon fax 503-228-5662
>
>
> ----- End Included Message -----
>
>

---------------------------------------------------------------------------
--==--==--==- Michael Kirkham Software Engineer
==--==--==--= Email: mikek@iwl.com Web: http://www.iwl.com/
--==--==--==- InterWorking Labs, Inc. 244 Santa Cruz Ave, Aptos, CA 95003
==--==--==--= Tel: +1 408 685 3190 Fax: +1 408 662 9065
---------------------------------------------------------------------------

Next message: Ira Mcdonald x10962: "Re: PMP> Localization conclusion - prtGeneralPrinterName"
Previous message: JK Martin: "Re: PMP> Localization conclusion - prtGeneralPrinterName"