PMP Mail Archive: Re: PMP> Revised proposal on definition of OCTET STRING to allow superset of ASCII

PMP Mail Archive: Re: PMP> Revised proposal on definition of OCTET STRING to allow superset of ASCII

Re: PMP> Revised proposal on definition of OCTET STRING to allow superset of ASCII

Ira Mcdonald x10962 (imcdonal@eso.mc.xerox.com)
Tue, 22 Jul 1997 15:27:10 PDT

Hi David,

If we took your 'suggestion' below, and made all current OCTET
STRING objects which are supposed to be ASCII, allow any UTF-8
sequence, then we would certainly break both HP and Lexmark
implementations (and I have reason to be believe also, some
unshipped Xerox products under development) and probably several
others - where's the gain.

RFC 2130 does NOT say just use 'UTF-8'. It says that all new
application protocols shall use 'UTF-8' as their default character
set and that they shall convey both 'Charset' (per the IANA
registry, based on ISO, ANSI, and various other national standards)
and 'Language' (per RFC 1766, which builds a combined language-country
tag up from the same two ISO standards currently used in the
Printer MIB, but permits more suffixes for 'dialects').

RFC 2130 does NOT deprecate the fact that HTTP uses ISO 8859-1
(Latin-1) as its default character set. It simply says all
new protocols (and their is NO exception made of MIB interfaces
here) shall be explicit about their default character set
and language for text strings and MAY support alternate
character set and language choices.

Cheers,
- Ira McDonald (outside consultant at Xerox)
--------------------------- Dave's note -------------------------
Return-Path: <pmp-owner@pwg.org>
Received: from zombi (zombi.eso.mc.xerox.com) by snorkel.eso.mc.xerox.com (4.1/XeroxClient-1.1)
id AA14980; Tue, 22 Jul 97 18:17:31 EDT
Received: from alpha.xerox.com by zombi (4.1/SMI-4.1)
id AA08237; Tue, 22 Jul 97 18:14:22 EDT
Received: from lists.underscore.com ([199.125.85.31]) by alpha.xerox.com with SMTP id <52408(2)>; Tue, 22 Jul 1997 15:14:21 PDT
Received: from localhost (daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) with SMTP id SAA04090 for <imcdonal@eso.mc.xerox.com>; Tue, 22 Jul 1997 18:10:29 -0400 (EDT)
Received: by pwg.org (bulk_mailer v1.5); Tue, 22 Jul 1997 18:09:38 -0400
Received: (from daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) id SAA03949 for pmp-outgoing; Tue, 22 Jul 1997 18:08:49 -0400 (EDT)
Date: Tue, 22 Jul 1997 16:06:57 PDT
From: David_Kellerman@nls.com
To: pmp@pwg.org
Message-Id: <009B7A39.5C62A1D2.1@nls.com>
Subject: Re: PMP> Revised proposal on definition of OCTET STRING to allow
superset of ASCII
Sender: pmp-owner@pwg.org
Status: R

> Here is my third message to help everyone make an informed
> decision.
>
> Since our Area Directors have a lot of experience
> with character sets and applications, we asked them if they had
> any suggestions and advice. Harald A. responded, and he did
> strongly encourage us to define the character set. (See below.)
>
> When we reviewed these suggestions and advice against the
> original localization proposal (late June), there did not seem
> to be any way to make it all work. Now it seems possible that
> Tom Hastings' latest proposal would be a satisfactory compromise
> that incorporates this advice and has minimal impact to the MIB.
>
> So we have to ask everyone to think through the technical issues
> and see if this is the case.

This is going to be a bit repetitive (a no-no in e-mail, I know), but
this issue seems to create a lot of confusion.

To my mind, what Tom is proposing is very different from:
1. Randy Presun's e-mail and the SYSAPPL MIB approach to character sets
2. Harald Alvestrand's e-mail
3. The RFC 2130 (Character Set Workshop Report) recommendations
The way I read all these sources, they essentially say to use ISO 10646
(roughly UNICODE worked over by ISO, for those of you still getting your
bearings) as the base character set and UTF-8 as the character encoding
scheme (again, roughly speaking, encodes ISO 10646 codes as multi-byte
sequences, seven-bit single-byte codes happen to match ASCII).

Tom's approach, and similarly the approach taken with the
prtLocalizationCharacterSet MIB object, allows multiple character sets
and encodings. You need to know the encoding to interpret the codes;
one code represent different characters in different encodings. In
Tom's proposal, the determination of encoding takes place outside the
MIB.

Now these two approaches are not the same, by a long shot. And it's my
understanding that in other places, proponents of opposing sides line up
with armor and broadsword to debate the issue. Being an applications
software person, I happen to prefer the UTF-8 approach. Now I'm not a
licensed character set professional, I've misplaced my broadsword, and
my armor doesn't fit anymore, so I'm feeling a little handicapped in the
debate.

So, Chris, I know you'd like to find a "satisfactory compromise" here,
but I don't see where you've got convergence of positions. (Between
your advisors and Tom's proposal, in particular.) Perhaps Tom
would like to propose that all the strings now constrained as ASCII be
allowed to contain UTF-8 codes?

:: David Kellerman Northlake Software 503-228-3383
:: david_kellerman@nls.com Portland, Oregon fax 503-228-5662