PMP Mail Archive: Re: PMP> what happens if we use UTF-8 in place of ASCII

Re: PMP> what happens if we use UTF-8 in place of ASCII

Ira Mcdonald x10962 (imcdonal@eso.mc.xerox.com)
Fri, 25 Jul 1997 14:04:33 PDT

Hi Harry,

There are no 'levels' of UTF-8. It is an encoding compatible
with US-ASCII in the 'left-half' of any octet of the FULL
UCS-2 Unicode set of 2-byte code points. For Western European
languages, it probably averages 2.5 bytes per character.
For non-Western languages, it takes up to 4 bytes per character.

As Tom already mentioned, to map from the UTF-8 set to some
of the other Asian national sets is a 14-bit table lookup
(BIG!!), so saying the client will just cope (if the whole
rest of the local network environment is running JIS X0208)
is very naive - code conversions are not cheap, for either
the managed device or the management stations (ie, clients).

Cheers,
- Ira McDonald (outside consultant at Xerox)
High North Inc
PO Box 221
Grand Marais, MI 49839
906-494-2434

--------------------------- Harry's note -------------------------
Return-Path: <pmp-owner@pwg.org>
Received: from zombi (zombi.eso.mc.xerox.com) by snorkel.eso.mc.xerox.com (4.1/XeroxClient-1.1)
id AA16466; Fri, 25 Jul 97 14:45:01 EDT
Received: from alpha.xerox.com by zombi (4.1/SMI-4.1)
id AA27451; Fri, 25 Jul 97 14:41:45 EDT
Received: from lists.underscore.com ([199.125.85.31]) by alpha.xerox.com with SMTP id <53469(1)>; Fri, 25 Jul 1997 11:41:49 PDT
Received: from localhost (daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) with SMTP id OAA07168 for <imcdonal@eso.mc.xerox.com>; Fri, 25 Jul 1997 14:37:59 -0400 (EDT)
Received: by pwg.org (bulk_mailer v1.5); Fri, 25 Jul 1997 14:36:27 -0400
Received: (from daemon@localhost) by lists.underscore.com (8.7.5/8.7.3) id OAA07023 for pmp-outgoing; Fri, 25 Jul 1997 14:35:16 -0400 (EDT)
From: Harry Lewis <harryl@us.ibm.com>
To: <pmp@pwg.org>
Subject: PMP> what happens if we use UTF-8 in place of ASCII
Message-Id: <5030100005326866000002L062*@MHS>
Date: Fri, 25 Jul 1997 11:36:34 PDT
Mime-Version: 1.0
Content-Type: text/plain
Sender: pmp-owner@pwg.org
Status: R

David, thanks for changing the name and focus of this topic. I have a
question... does using UTF-8 put any greater burden on the agent in terms
of size of character set and storage required? I've been trying to catch up
with all those URGENT mail messages, I've pulled RFC 2044 and I think I
know what UTF-8 is and why it exists... but I'm not sure what the consensus
is regarding how much of Unicode or ISO 10646 is required behind the UTF-8.

Harry Lewis - IBM Printing Systems

-------- Forwarded by Harry Lewis/Boulder/IBM on 07/25/97 12:27 PM -------

pmp-owner@pwg.org
07/24/97 06:50 PM
Please respond to pmp-owner@pwg.org @ internet

To: pmp@pwg.org @ internet
cc:
Subject: PMP> what happens if we use UTF-8 in place of ASCII

New thread; got tired of looking at that "URGENT" line. Also, I figured
that if this note had a new title, your curiousity would get the better
of you, and you might read the note instead of automatically filing it.

Thought it would be useful to summarize the implications of switching
the ASCII OCTET STRINGs to Utf8String syntax.

Agents:
1. Need to make sure any read-only strings that aren't ASCII use UTF-8
encoding. I'd guess that 99% are ASCII.
2. Remove any 7-bit checking code. Sounds like most agents already
accept 8-bit codes without checking (sounds like Osicom/DPI actually
checks -- someone knew what ASCII meant).
So most agents work unchanged, the rest with slight change. It's worth
noting that, within the SNMP domain, the agent should be able to treat
the affected strings simply as octets. It doesn't do anything, such as
displaying the strings, that would require it to "know" UTF-8.

Agent environment:
1. If, outside the SNMP domain, the agent device displays the affected
strings, there may be a need for character set conversion.

Application environment:
1. In environments which do not use UTF-8 as the native encoding,
convert codes to/from UTF-8.
It's worth noting that existing applications that operate in an ASCII
environment aren't affected.

As you think about "what does this mean for my management application,"
here's another thing to consider. Existing applications that use
non-UTF-8 codes will also "work" as long as they don't run up against
another application that expects UTF-8 (or another encoding) -- as noted
above, the agent doesn't know whether the code really UTF-8 or not. And
we ought to keep in mind that the installed base of applications
consists mostly of one vendor's management application talking to the
same vendor's printers.

In practice, this is no worse than the situation today with applications
that have used non-ASCII character encodings in these strings. Two
applications that have opted for different code don't interoperate. But
a non-standard application operating by itself can get away with it.

Curiously, it also is no worse than a multiple-character set solution
where two applications have chosen different character sets. With the
multiple-character set solution, we would be in effect "standardizing"
this lack of interoperability.

:: David Kellerman Northlake Software 503-228-3383
:: david_kellerman@nls.com Portland, Oregon fax 503-228-5662