PMP Mail Archive: Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to

Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to

David_Kellerman@nls.com
Wed, 23 Jul 1997 18:43:21 PST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Tom Hastings: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET"
Previous message: Michael Kirkham: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to allow superset of ASCIIPine.SUN.3.93.970723164251.6167E-100000@iwl.iwl.com"
Maybe in reply to: Tom Hastings: "PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to"
Next in thread: David_Kellerman@nls.com: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to"

Michael,

One moment you're walking on solid ground; the next thing you know,
you're up to your armpits in the character set tar pit.

You've done a nice job of drawing together a lot of threads of
material, but the situation is not as problematic as you suggest.

The references you see to "UTF-8" are really shorthand for "ISO/IEC IS
10646-1 character set, encoded as an octet string using the UTF-8
character encoding scheme described in RFC 2044." (That's how it's
spelled out in the SysAppl MIB definition of the Utf8String TC, and you
can see why people use the shorthand.) UTF-8 applies only to ISO 10646.

So UTF-8 covers the whole span, from a set of symbols to their encoding
as octet sequences. I give you a UTF-8 sequence, you know what symbols
are represented.

The nice thing about UTF-8 is that "it has the characteristic of
preserving the entire US-ASCII range: US-ASCII characters are encoded in
one octet having the usual US-ASCII value, and any octet with such a
value can only be an US-ASCII character." (RFC 2044) All the additional
characters in ISO 10646 are encoded as multi-octet sequences in which
all octets have the eighth bit set.

:: David Kellerman Northlake Software 503-228-3383
:: david_kellerman@nls.com Portland, Oregon fax 503-228-5662

------------------------------------------------------------------------
Date: Wed, 23 Jul 1997 17:27:48 -0700 (PDT)
From: Michael Kirkham <mikek@iwl.com>
To: JK Martin <jkm@underscore.com>
CC: pmp@pwg.org
Subject: Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to
allow superset of ASCII

Pardon me for interjecting -- in general I'm only advising Chris, since
I'm not officially part of the working group -- but I think there's some
misunderstanding between various working group members about just what
UTF-8 is and what benefit or drawback there is in switching from ASCII to
UTF-8. Here is my interpretation (and I'm by no means a character set
or localization expert; feel free to correct me if I'm wrong)..

Workshop," and skimming through various other documents on UTF-8, it looks
to me on the surface that ASCII and UTF-8 are entirely different beasts,
and it is not simply a matter of switching from one to the other.

The way ASCII and UTF-8 differ is really in what they are, and what they
are trying to accomplish. ASCII is a "Coded Character Set," whereas UTF-8
is a "Character Encoding Scheme." RFC 2130 defines these to be:

3.2.1: Coded Character Set

A Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers. Examples of coded character sets
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
[ISO-8859].

3.2.2: Character Encoding Scheme

A Character Encoding Scheme (CES) is a mapping from a Coded
Character Set or several coded character sets to a set of octets.
Examples of Character Encoding Schemes are ISO 2022 [ISO-2022] and
UTF-8 [UTF-8]. A given CES is typically associated with a single
CCS; for example, UTF-8 applies only to ISO 10646.

To paraphrase, a Coded Character Set, such as ASCII, is the association of
a particular glyph (a graphical, visual representation such as the letter
A, or an asterisk) with a numerical value. (Example, in the case of
ASCII, the glyph of the letter "A" has a value of 65).

A Character Encoding Scheme, such as UTF-8 is, on the other hand, is a way
of encoding different Coded Character Sets (such as ASCII, or katakana, or
cyrillic) into a stream of octets, such that you can algorithmically
determine that a set of N octets is a single glyph, the next set of M
octets is another glyph, and so on -- but not to determine just what glyph
that is.

In other words, given a stream of octets that is encoded in a Character
Encoding Scheme such as UTF-8, there is no way to know intrinsically what
glyphs to display. A value encoded in UTF-8 is therefor meaningless
without knowing what Coded Character Set (such as ASCII) it was mapped
from -- unless you know, by virtue of Printer-Y being sold in Japan that
the Coded Character Set is katakana (but the Character Encoding Scheme
used to store that value in the MIB object is still UTF-8).

The association between ASCII and UTF-8 is what RFC 2130 simply refers to
as an "Exact Conversion" by nature of the ASCII Coded Character Set being
identical to ASCII-encoded-in-UTF-8 (RFC 2130 section 3.5 - 3.5.1).

What I'm getting at is that it doesn't seem to me that you can actually
gain anything on the localization issue by using UTF-8. Yes you can now
write katakana into a MIB object -- but you can't get katakana out, unless
something somewhere says that the value you are reading is katakana, or
unless you just *know*. If you don't simply know, it falls into what RFC
2130 calls "Determined by guessing" (section 3.3), and which it
discourages:

We recommend that each protocol clearly specify what it is using for
each of the layers of the transmission model. Users (or clients)
should never have to guess what the parameter is for a given layer.
(RFC 2130 section 3.4.1)

One problem that arises, and has been brought up, is how to handle the
case where Management Station A sets the value of a MIB Object and
Management Station B has to read the value and know what it means. This
is a problem in both (a) relaxing a DisplayString object to Octet String
and allowing any Coded Character Set -and- (b) encoding that Coded
Character Set in UTF-8, for the reasons outlined above.

This problem may not be addressable with the idea of having a single MIB
object specifying what Coded Character Set a group of objects are using --
a management station can still encode Cyrillic in UTF-8 and then set a MIB
object, even though some MIB object says that the MIB is using ISO-10646.

RFC 2130 also goes on to make recommendations about how you determine the
Coded Character Set of a given bit of text, such as MIME headers, etc. A
solution along these lines is beyond the scope of the Printer MIB. What
SNMP really needs in this area is it's own version of a MIME header in
both the SetRequest-PDU and the GetResponse-PDU, so that a management
station can issue a Set Request saying, "This is ISO-10646 encoded in
UTF-8," and later another management station can read the value with the
Get Response saying "This UTF-8 value should be decoded to ISO-10646."

This is, I believe, where future versions of SNMP are probably heading.

As I said, if I've misinterpreted the relationship between ASCII/Coded
Character Sets and UTF-8/Character Encoding Schemes, please let me know.
picture, and I fully admit I'm not entirely clear myself. This message
isn't directed at anyone in particular -- only trying to make sure
everyone sees the same picture.

On Wed, 23 Jul 1997, JK Martin wrote:

> I think Dave Kellerman's new proposal may very well be the best
> way to solve this last-minute crisis in a decent way.
>
> As I understand Dave's proposal, instead of having a fixed charset
> of ASCII, we simply move it to UTF-8, and thereby gain a reasonable
> amount of localization with absolutely minimal impact to the MIB.
>
> Also in support of Dave's proposal--and this means a lot to us as
> mgmt app developers--we stand a much better chance of having decent
> interoperability if we constrain the charset to UTF-8.
>
> I hope the PMP group accepts this new proposal for UTF-8, assuming
> of course no one pokes a big fat hole in it. ;-)
>
> ...jay
>
> ----- Begin Included Message -----
>
> >From pmp-owner@pwg.org Wed Jul 23 18:50 EDT 1997
> Date: Wed, 23 Jul 1997 15:46:51 PST
> From: David_Kellerman@nls.com
> To: pmp@pwg.org
> Subject: Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to
> allow superset of ASCII
>
> If there really is a broad interest in "fixing" the localization
> problem, I would suggest an alternative to Tom's proposal -- switch from
> ASCII to UTF-8 for OCTET STRING objects where representation of
> multilingual text is appropriate.
>
> Summary of arguments in favor: no new objects, consistent with existing
> conforming implementations (ASCII is subset of UTF-8), doesn't introduce
> the complexity of multiple character sets for affected objects, doesn't
> introduce the complexity of changeable character sets for affected
> objects, seems to be consistent with direction of IETF generally and
> SNMP in particular.
>
> Problems I see are, briefly: forces implementations to deal with UTF-8,
> and it conflicts with existing implementations that allow non-ASCII
> characters in the strings. How serious these are depends, in part, on
> whether you believe other MIB work is going to force UTF-8 anyway, and
> how much weight you want to give to existing practice that deviates from
> the existing standard.
>
> Supporting material:
> 1. See the note from Randy Presuhn that Chris forwarded to the mailing
> list. He suggests this approach, has obviously given the topic a
> lot of thought, and discusses it in some detail. He also asserts
> that the SNMPv3 effort is headed toward use of UTF-8 for all
> human-readable strings.
> 2. I read Harald Alvestrand's message differently than Tom. I think it
> says to specify the character set (a single one) and recommends
> UTF-8; not to allow multiple character sets, chosen at the
> discretion of the agent or application.
> 3. I also read RFC 2130 (The Character Set Workshop Report) differently
> than Tom. It covers a lot of ground, trying to address migration of
> existing protocols as well as new work. For new protcols in
> particular, it says in part:
> New protocols do not suffer from the need to be compatible with
> old 7-bit pipes. New protocol specifications SHOULD use ISO
> 10646 as the base charset unless there is an overriding need to
> use a different base character set.
>
> Here are the details of the changes to the document:
>
> 1. Copy the Utf8String TC from the sysAppl draft:
>
> Utf8String ::= TEXTUAL-CONVENTION
> DISPLAY-HINT "255a"
> STATUS current
> DESCRIPTION
> "To facilitate internationalization, this TC
> represents information taken from the ISO/IEC IS
> 10646-1 character set, encoded as an octet string
> using the UTF-8 character encoding scheme described
> in RFC 2044 [**]. For strings in 7-bit US-ASCII,
> there is no impact since the UTF-8 representation
> is identical to the US-ASCII encoding."
> SYNTAX OCTET STRING (SIZE (0..255))
>
> Stylistically, you might want to introduce a ShortUtf8String with
> SIZE (0..63) -- it would simplify many of the SYNTAX clauses (see
> below).
>
> 2. Change the SYNTAX for the following objects from OCTET STRING:
>
> prtGeneralCurrentOperator Utf8String (SIZE(0..127))
> prtGeneralServicePerson Utf8String (SIZE(0..127))
> prtGeneralSerialNumber Utf8String
> prtGeneralPrinterName Utf8String
>
> prtInputMediaName Utf8String (SIZE(0..63))
> prtInputName Utf8String (SIZE(0..63))
> prtInputVendorName Utf8String (SIZE(0..63))
> prtInputModel Utf8String (SIZE(0..63))
> prtInputVersion Utf8String (SIZE(0..63))
> prtInputSerialNumber Utf8String (SIZE(0..32))
>
> prtInputMediaType Utf8String (SIZE(0..63))
> prtInputMediaColor Utf8String (SIZE(0..63))
>
> prtOutputName Utf8String (SIZE(0..63))
> prtOutputVendorName Utf8String (SIZE(0..63))
> prtOutputModel Utf8String (SIZE(0..63))
> prtOutputVersion Utf8String (SIZE(0..63))
> prtOutputSerialNumber Utf8String (SIZE(0..63))
>
> prtMarkerColorantValue Utf8String
>
> prtChannelProtocolVersion Utf8String (SIZE(0..63))
>
> prtInterpreterLangLevel Utf8String (SIZE(0..31))
> prtInterpreterLangVersion Utf8String (SIZE(0..31))
> prtInterpreterVersion Utf8String (SIZE(0..31))
>
> 3. Add the reference to RFC 2044 to the bibliography:
>
> [**] F. Yergeau, "UTF-8, a transformation format of Unicode
> and ISO 10646", RFC 2044, October 1996.
>
> That's it.
>
> :: David Kellerman Northlake Software 503-228-3383
> :: david_kellerman@nls.com Portland, Oregon fax 503-228-5662
>
>
> ----- End Included Message -----
>
>

---------------------------------------------------------------------------
--==--==--==- Michael Kirkham Software Engineer
==--==--==--= Email: mikek@iwl.com Web: http://www.iwl.com/
--==--==--==- InterWorking Labs, Inc. 244 Santa Cruz Ave, Aptos, CA 95003
==--==--==--= Tel: +1 408 685 3190 Fax: +1 408 662 9065
---------------------------------------------------------------------------

Next message: Tom Hastings: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET"
Previous message: Michael Kirkham: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to allow superset of ASCIIPine.SUN.3.93.970723164251.6167E-100000@iwl.iwl.com"
Maybe in reply to: Tom Hastings: "PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to"
Next in thread: David_Kellerman@nls.com: "Re: PMP> URGENT: SYNTHESIS proposal on definition of OCTET STRING to"