Subj: 5 alternatives to removing the code set ambiguity in the Job Mon MIB
From: Tom Hastings
Date: 7/25/97
File: jmpalter.doc
The Job Monitoring MIB has the same problem of identifying and/or specifying
coded character sets that we are currently wrestling with for the Printer MIB.
The solutions do NOT necessarily need to be the same.
David Perkin's has made this comment when reviewing our job mon MIB. His
comment states:
"The module has a problem with all the objects that have type of OCTET
STRING. You really need to enforce a code-point mapping. Consider a
management application. What are they to do with the values?
Do they try to figure out the encoding, or ask the user of the
application for a hint, or what?"
Here are 5 alternatives that I can think of:
1. Leave OCTET STRING undefined as in the current draft and therefore open
to any coded character set, including any of the 211 registered with IANA.
2. Restrict OCTET STRING to US-ASCII in 32 to 126; unused: 0-31, 126-255.
3. Allow ANY coded character set, even ones that don't have US-ASCII in 32
to 126, but have a (read-only) attribute to specify which coded character
set the agent has used to store the data (usually the code that the submitter
used for that job).
4. US-ASCII in 32-126, other specified sets in 128 to 255; unused: 0-31, 126,
but have a read-only attribute to specify which coded character set the
agent has used to store the data (usually the code that the submitter
used for that job). (My Wednesday SYNTHESIS proposal)
5. Only UTF-8 in 32 to 126 and 128 to 255; unused: 0-31, 127
(David Kellerman's proposal)
Now the PROs and CONs of each of the 5 alternatives:
Unlike the Printer MIB, we can consider a "wide open" alternative
which allows any coded character set, not just ones that are restricted
to having US-ASCII in code positions 32 to 126. So straight unicode
or any national 7-bit set could be used in this alternative (see alernative
3).
We are only dealing with the coded character set (and encoding scheme) of
the objects and attributes that have syntax OCTET STRING. We are NOT
attempting to solve the much harder problem of localization
that includes language and country for these objects and attributes.
Also for the Job Monitoring MIB, unlike the Printer MIB, all objects are
read-only, so the application can NOT write any.
ISSUE 111 (restated): How does an application determine the coded character
set for the 3 objects and attributes that the agent generates (that did not
come from the job submitter)?
The following objects and attributes are in question:
jmGeneralJobSetName object
processingMessage attribute
physicalDevice (name value) attribute
ISSUE 112: (restated) How does an application tell the coded character set of
the 19 objects/attributes that come from the job subitter (or are defaulted
by the server/device)?
The 19 objects and attributes in question are:
IETF Job object/attributes Equivalent IPP attributes
-------------------------- -------------------------
jmJobOwner object "job-originating-user"
other, -
unknown, -
serverAssignedJobName, -
jobName, "job-name"
jobAccountName, -
submittingServerName, -
submittingApplicationName, -
jobOriginatingHost, "job-originating-host"
deviceNameRequested, "printer-uri"
queueNameRequested, -
fileName, "document-uri"
documentName, "document-name"
jobComment, -
outputBin (name), -
mediumRequested (name), "media"
mediumConsumed (name), -
colorantRequested (name), -
colorantConsumed (name) -
We have five alternatives proposed for these OCTET STRING objects
and attributes for either of issues 111 and 112:
NOTE: The solutions for issues 111 and 112 do NOT need to be the same.
1. Leave OCTET STRING undefined as in the current draft and therefore open
to any coded character set, including any of the 211 registered with IANA.
The current text is:
"Internationalization Considerations
There are a number of objects in this MIB that are represented as coded
character sets with a data type of OCTET STRING. Most of the objects are
supplied as job attributes by the client that submits the job to the server
or device and so are represented in the coded character set specified by
that client.
For simplicity, this specification assumes that the clients, job monitoring
applications, servers, and devices are all running in the same locale,
including locales that use two-octet coded character sets, such as ISO 10646
(Unicode). Job monitoring applications are expected to understand the coded
character set of the client (and job), server, or device. No special means
is provided for the monitor to discover the coded character set used by jobs
or by the server or device. This specification does not contain an object
that indicates what locale the server or device is running in, let alone
contain an object to control what locale the agent is to use to represent
coded character set objects.
This MIB also contains objects that are represented using the DateAndTime
textual convention from SMIv2 [SMIv2]. The job management application SHALL
display such objects in the locale of the user running the monitoring
application."
PRO:
b. Easiest for us to do.
d. Some job submision protocols don't specify what the coded character set
is either in their specification or in the protocol, so that the agent
hasn't a clue what the coded character set is that the job submission used.
CON:
a. Possible that our Area Director won't forward the document, since it is
ambiguous about coded character set and so it would not become a draft
standard.
c. The Area Director might fix the problem for us (but we might not like
the results, if we don't participate).
d. The agent can put the value 'unknown(2)' to indicate that the coded
character set is unknown, thus making it clear to the application that the
application has to use other means in order to determine the coded
character set, such as asking the user, assuming the default for the host
platform, or assuming the default for the agent (if we add such an object).
2. Restrict OCTET STRING to US-ASCII in 32 to 126; unused: 0-31, 126-255.
- Change the document to restrict OCTET STRING data to
US-ASCII in 32-126 and that 128 to 255 SHALL NOT be used.
- Indicate that code positions 0-32 and 127 SHALL NOT be used, unless
the DESCRIPTION clause specifies otherwise.
- Also add a proper reference to the ANSI X3.4:1968 that specifies ASCII.
PRO:
a. Satisfies David Perkin's and the Area Director by clarifying what the
interpretation of these objects shall be with respect to coded character
set and encoding scheme.
CON:
c. Would not meet market objects of many of the implementations of
job submision protocols that already support other parts of the world
where US-ASCII is not sufficient to represent job submitter
supplied information.
d. Would NOT even meet the needs of IPP, which specifies UTF-8.
3. Allow ANY coded character set, even ones that don't have US-ASCII in 32
to 126, but have a (read-only) attribute to specify which coded character
set the agent has used to store the data (usually the code that the submitter
used for that job).
- Allow any graphic character set in 0 to 255.
- Don't allow any control characters in 0 to 255.
- Recommend UTF-8 as the default.
- Add a read-only object for specifying the code set for the 3
objects/attributes that do not come from the job submitter.
- Add a (readonly) attribute for specifyiing the code set for the 19
objects/attributes that come from the job submitter. (The agent can
either keep the data in the original char set of the submitter, or can
code convert to another coded character set, though there is not reason
why an agent would have to convert.)
- Any registered set could be used.
- See: ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.
PRO:
a. Would satisfy David Perkins and the Area Directors to specify the coded
character set (in the protocol).
b. Conforms to current practice of job submission protocols, including ones
that use Unicode (which doesn't have US-ASCII in 32 to 126)..
c. Includes UTF-8 (recommended default) as recommended by the IAB in
RFC 2130.
d. Makes the agents life easier, since the agent would not have to do any
code conversion from the data that is submitted with the job. The agent
either plugs in the code character set or the 'unknown(2)' value.
e. Allows an application to determine the coded character set of the
objects and attributes, if known to the agent.
CON:
a. It would be too wide open. Applications would have to deal with many
more sets (or hope that their host platform does). Most data can be
represented in US-ASCII, so losing that even 32 to 126 is fixed would be
a great burden on the application.
4. US-ASCII in 32-126, other specified sets in 128 to 255, unused: 0-31,
but have a read-only attribute to specify which coded character set the
agent has used to store the data (usually the code that the submitter
used for that job). (My Wednesday SYNTHESIS proposal)
- Allow any graphic characters in 128 to 255, but 32-126 SHALL be US-ASCII
- Indicate that code positions 0-32 and 127 SHALL NOT be used.
- Recommend UTF-8 as the default.
- Add a read-only object for specifying the code set for the 3
objects/attributes that do not come from the job submitter.
- Add a (readonly) attribute for specifyiing the code set for the 19
objects/attributes that come from the job submitter.
- The agent can either keep the data in the original char set of the
submitter, or can code convert to another coded character set.
- The agent SHALL code convert if the job submission code set does not
have US-ASCII in 32-126. For example, the agent SHALL code convert from
Unicode to UTF-8.
- Any of the 50 out of 211 registered set could be used that contain
US-ASCII in 32 to 126.
- See: ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.
- List conforming examples, which include UTF-8 (recommended default),
ISO 8859-1 (Latin1), HPRoman8, Code page 850, US-ASCII,
US-ASCII/JIS X0208 (Japanese national two byte set in 128-255),
Shift JIS (Microsoft extension to extend Katakana to represent Kanji),
US-ASCII/GB2312 (PRC Chinese national two-byte set in 128-255).
- Add a proper reference to the ANSI X3.4:1968 that specifies ASCII,
plus all the examples.
PRO:
a. Conforms to current practice of most job submission protocols.
b. Permits UTF-8 to be used as recommended by the IAB for new protocols
in RFC 2130.
c. Recommends UTF-8 as the default as recommended by the IAB in RFC 2130.
d. Permits other coded character sets, as allowed by the IAB for existing
protocols in RFC 2130.
e. Allows Job Monitoring MIB implementations to use the coded character set
that the customer's environment uses (US-ASCII, Unicode, ISO 8859-n
(n=1..11), Microsoft extensions to LatinN, Code page 850, HP Roman8,
JIS X0208, Shift JIS, GB 2312, ...).
f. Allows the vendor supplied and the system administrator supplied data to
be represented in a SINGLE coded character set established at install time.
(See separate e-mail on how this works in current vendor implementations).
CON:
g. Harder for applications that want to process the values returned from
MIB (as opposed to simplying displaying data which is usually handled by
the platform), if the data includes values in 128 to 255 and the
application needs to support more than one possible coded character set
that the system administrator could have specified at install time. For
example, if the application is supporting the Western Hemisphere and
Western Europe, the application might need to support, ISO Latin1,
HP Roman8 and Code page 850, depending on the customer's environment.
Similar situation for Asia where the application might have to support
Unicode/UTF-8, JIS X0208, and GB2312.
e. If the coded character set specified for the MIB is different from
that supported by the host platform in which the application is running,
the application will have to perform code conversion in order to display
the coded character set data to the user.
f. Complicates the system administrator install procedures, since the
information on the install floppy needs to be represented in different
coded character sets.
5. Only UTF-8 in 32 to 126 and 128 to 255; unused: 0-31, 127
(David Kellerman's proposal)
- Allow only UTF-8 which is US-ASCII in 32-126 and a multi-byte character
encoding scheme in 128 to 255 that represent the characters of
ISO 10646 (Unicode).
- Indicate that code positions 0-32 and 127 SHALL NOT be used.
- Also add a proper reference to the ANSI X3.4:1968 that specifies ASCII
and a reference to UTF-8.
PRO:
a. It is the recommendation of the IAB in RFC 2130 for "new [Internet]
protocols or new versions of old protocols" to use UTF-8 as the "default".
The job monitoring MIB is a *new* protocol, we should take the opportunity
to follow the IAB recommendation now.
c. Only a single coded character set is permitted, so that applications
only have to deal with a single fixed coded character set at design time,
namely UTF-8.
e. ISO 10646 and UTF-8 are winning support in many quarters for actual
implementation. NT has Unicode as its internal code set. IPP specifies
all text attribute values in UTF-8. Novell Netware 4.2 supports Unicode.
f. Don't need to add an object or an attribute to indicte the coded
character set of either the 3 objects/attributes that don't come from the
job submitter and the 19 that do.
CON:
a. The same paragraph of RFC 2130 (page 3) goes on to say: "These defaults
do not deprecate the use of other character sets when and where they are
needed; they are simply intended to provide guidance and a specification
for interoperability.
In fact, RFC 2130 does not even mention SNMP as one of the Internet
Protocols. I wonder why? Because SNMP is more likely to be deployed on
a LAN, not the Internet?
b. Most current job submission protocols do NOT use UTF-8, so that the
agent SHALL have to perform code conversion to UTF-8.
c. Forces applications to deal with UTF-8, when some applications would be
far simpler to just use the coded characer set of the environment.
d. Many applications do NOT actually need to process the information from
the MIB; they merely pass it through to the host platform, which takes care
of displaying the information. Unless the platform supports UTF-8 (or
equivalently Unicode, such as NT or Novell 4.1), the applicataion will have
to convert the coded character set data from UTF-8 to some other coded
character set that the host platform can display to the user.
e. Accpeptance of ISO 10646 (Unicode) in Asian markets has not been
enthusiastic. Many customers have huge investments in data and
applications that use their national two byte sets (JIS X0208 Japanese and
GB2312 Chinses). So the Asian vendors have not jumped on the ISO 10646
bandwagon. Some have, some have not. I don't have good figures on the
the size of each camp. I think it also depends on the application area.
Code conversion between UTF-8 and JIB X0208 and GB2312 is a hugh 14-bit
table lookup, i.e., a 16 mega-byte table.
NOTE: both proposals 4 and 5 are upgrades from US-ASCII, so that PRO is not
mentioned with either alternative.