JMP> 5 alternatives for indicating code set (Issues 111 and 112)

Mon Jul 28 03:40:37 EDT 1997

Subj:  5 alternatives to removing the code set ambiguity in the Job Mon MIB
From: Tom Hastings
Date: 7/25/97
File:  jmpalter.doc

The Job Monitoring MIB has the same problem of identifying and/or specifying
coded character sets that we are currently wrestling with for the Printer MIB.
The solutions do NOT necessarily need to be the same.

David Perkin's has made this comment when reviewing our job mon MIB.  His
comment states:

  "The module has a problem with all the objects that have type of OCTET
  STRING.  You really need to enforce a code-point mapping. Consider a 
  management application. What are they to do with the values? 
  Do they try to figure out the encoding, or ask the user of the 
  application for a hint, or what?"

Here are 5 alternatives that I can think of:

1. Leave OCTET STRING undefined as in the current draft and therefore open
to any coded character set, including any of the 211 registered with IANA.

2. Restrict OCTET STRING to US-ASCII in 32 to 126;  unused: 0-31, 126-255.

3. Allow ANY coded character set, even ones that don't have US-ASCII in 32
to 126, but have a (read-only) attribute to specify which coded character
set the agent has used to store the data (usually the code that the submitter
used for that job).

4. US-ASCII in 32-126, other specified sets in 128 to 255; unused: 0-31, 126,
but have a read-only attribute to specify which coded character set the
agent has used to store the data (usually the code that the submitter
used for that job).  (My Wednesday SYNTHESIS proposal)

5. Only UTF-8 in 32 to 126 and 128 to 255; unused: 0-31, 127
   (David Kellerman's proposal)

Now the PROs and CONs of each of the 5 alternatives:

Unlike the Printer MIB, we can consider a "wide open" alternative 
which allows any coded character set, not just ones that are restricted
to having US-ASCII in code positions 32 to 126.  So straight unicode
or any national 7-bit set could be used in this alternative (see alernative
3).

We are only dealing with the coded character set (and encoding scheme) of 
the objects and attributes that have syntax OCTET STRING.  We are NOT 
attempting to solve the much harder problem of localization 
that includes language and country for these objects and attributes.

Also for the Job Monitoring MIB, unlike the Printer MIB, all objects are 
read-only, so the application can NOT write any.

ISSUE 111 (restated):  How does an application determine the coded character 
set for the 3 objects and attributes that the agent generates (that did not 
come from the job submitter)?  

The following objects and attributes are in question:

  jmGeneralJobSetName object
  processingMessage attribute
  physicalDevice (name value) attribute

ISSUE 112:  (restated) How does an application tell the coded character set of 
the 19 objects/attributes that come from the job subitter (or are defaulted
by the server/device)?

The 19 objects and attributes in question are:

  IETF Job object/attributes   Equivalent IPP attributes
  --------------------------   -------------------------
  jmJobOwner object            "job-originating-user"
  other,                       -
  unknown,                     -
  serverAssignedJobName,       -
  jobName,                     "job-name"
  jobAccountName,              -
  submittingServerName,        -
  submittingApplicationName,   -
  jobOriginatingHost,          "job-originating-host"
  deviceNameRequested,         "printer-uri"
  queueNameRequested,          -
  fileName,                    "document-uri"
  documentName,                "document-name"
  jobComment,                  -
  outputBin (name),            -
  mediumRequested (name),      "media"
  mediumConsumed (name),       -
  colorantRequested (name),    -
  colorantConsumed (name)      -

We have five alternatives proposed for these OCTET STRING objects
and attributes for either of issues 111 and 112:

NOTE: The solutions for issues 111 and 112 do NOT need to be the same.

1. Leave OCTET STRING undefined as in the current draft and therefore open
to any coded character set, including any of the 211 registered with IANA.

The current text is:

  "Internationalization Considerations
  There are a number of objects in this MIB that are represented as coded 
  character sets with a data type of OCTET STRING.  Most of the objects are 
  supplied as job attributes by the client that submits the job to the server
  or device and so are represented in the coded character set specified by 
  that client.

  For simplicity, this specification assumes that the clients, job monitoring 
  applications, servers, and devices are all running in the same locale, 
  including locales that use two-octet coded character sets, such as ISO 10646 
  (Unicode).  Job monitoring applications are expected to understand the coded 
  character set of the client (and job), server, or device.  No special means 
  is provided for the monitor to discover the coded character set used by jobs 
  or by the server or device.  This specification does not contain an object 
  that indicates what locale the server or device is running in, let alone 
  contain an object to control what locale the agent is to use to represent 
  coded character set objects.

  This MIB also contains objects that are represented using the DateAndTime 
  textual convention from SMIv2 [SMIv2].  The job management application SHALL 
  display such objects in the locale of the user running the monitoring 
  application."

   PRO:  

   b. Easiest for us to do.

   d. Some job submision protocols don't specify what the coded character set 
   is either in their specification or in the protocol, so that the agent 
   hasn't a clue what the coded character set is that the job submission used.

   CON:  

   a. Possible that our Area Director won't forward the document, since it is
   ambiguous about coded character set and so it would not become a draft 
   standard.

   c. The Area Director might fix the problem for us (but we might not like
   the results, if we don't participate).

   d. The agent can put the value 'unknown(2)' to indicate that the coded 
   character set is unknown, thus making it clear to the application that the 
   application has to use other means in order to determine the coded 
   character set, such as asking the user, assuming the default for the host 
   platform, or assuming the default for the agent (if we add such an object).

2. Restrict OCTET STRING to US-ASCII in 32 to 126;  unused: 0-31, 126-255.

   - Change the document to restrict OCTET STRING data to
     US-ASCII in 32-126 and that 128 to 255 SHALL NOT be used.  
   - Indicate that code positions 0-32 and 127 SHALL NOT be used, unless
     the DESCRIPTION clause specifies otherwise.  
   - Also add a proper reference to the ANSI X3.4:1968 that specifies ASCII.

   PRO:  

   a. Satisfies David Perkin's and the Area Director by clarifying what the 
   interpretation of these objects shall be with respect to coded character 
   set and encoding scheme.

   CON:  

   c. Would not meet market objects of many of the implementations of 
   job submision protocols that already support other parts of the world 
   where US-ASCII is not sufficient to represent job submitter
   supplied information.

   d. Would NOT even meet the needs of IPP, which specifies UTF-8.

3. Allow ANY coded character set, even ones that don't have US-ASCII in 32
to 126, but have a (read-only) attribute to specify which coded character
set the agent has used to store the data (usually the code that the submitter
used for that job).

   - Allow any graphic character set in 0 to 255.
   - Don't allow any control characters in 0 to 255.
   - Recommend UTF-8 as the default.
   - Add a read-only object for specifying the code set for the 3 
     objects/attributes that do not come from the job submitter.
   - Add a (readonly) attribute for specifyiing the code set for the 19 
     objects/attributes that come from the job submitter.  (The agent can 
     either keep the data in the original char set of the submitter, or can 
     code convert to another coded character set, though there is not reason
     why an agent would have to convert.)
   - Any registered set could be used.
   - See:  ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.

   PRO:  

   a. Would satisfy David Perkins and the Area Directors to specify the coded 
   character set (in the protocol).

   b. Conforms to current practice of job submission protocols, including ones
   that use Unicode (which doesn't have US-ASCII in 32 to 126)..

   c. Includes UTF-8 (recommended default) as recommended by the IAB in 
   RFC 2130.

   d. Makes the agents life easier, since the agent would not have to do any 
   code conversion from the data that is submitted with the job.  The agent
   either plugs in the code character set or the 'unknown(2)' value.

   e. Allows an application to determine the coded character set of the
   objects and attributes, if known to the agent.

   CON:

   a. It would be too wide open.  Applications would have to deal with many
   more sets (or hope that their host platform does).  Most data can be 
   represented in US-ASCII, so losing that even 32 to 126 is fixed would be
   a great burden on the application.

4. US-ASCII in 32-126, other specified sets in 128 to 255, unused: 0-31,
but have a read-only attribute to specify which coded character set the
agent has used to store the data (usually the code that the submitter
used for that job).  (My Wednesday SYNTHESIS proposal)

   - Allow any graphic characters in 128 to 255, but 32-126 SHALL be US-ASCII
   - Indicate that code positions 0-32 and 127 SHALL NOT be used.
   - Recommend UTF-8 as the default.
   - Add a read-only object for specifying the code set for the 3 
     objects/attributes that do not come from the job submitter.
   - Add a (readonly) attribute for specifyiing the code set for the 19 
     objects/attributes that come from the job submitter.  
   - The agent can either keep the data in the original char set of the 
     submitter, or can code convert to another coded character set.  
   - The agent SHALL code convert if the job submission code set does not
     have US-ASCII in 32-126.  For example, the agent SHALL code convert from 
     Unicode to UTF-8.
   - Any of the 50 out of 211 registered set could be used that contain
     US-ASCII in 32 to 126.
   - See:  ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.
   - List conforming examples, which include UTF-8 (recommended default), 
     ISO 8859-1 (Latin1), HPRoman8, Code page 850, US-ASCII, 
     US-ASCII/JIS X0208 (Japanese national two byte set in 128-255), 
     Shift JIS (Microsoft extension to extend Katakana to represent Kanji), 
     US-ASCII/GB2312 (PRC Chinese national two-byte set in 128-255).
   - Add a proper reference to the ANSI X3.4:1968 that specifies ASCII,
     plus all the examples.

PRO:
   a. Conforms to current practice of most job submission protocols.

   b. Permits UTF-8 to be used as recommended by the IAB for new protocols 
   in RFC 2130.

   c. Recommends UTF-8 as the default as recommended by the IAB in RFC 2130.

   d. Permits other coded character sets, as allowed by the IAB for existing
   protocols in RFC 2130.

   e. Allows Job Monitoring MIB implementations to use the coded character set 
   that the customer's environment uses (US-ASCII, Unicode, ISO 8859-n 
   (n=1..11), Microsoft extensions to LatinN, Code page 850, HP Roman8, 
   JIS X0208, Shift JIS, GB 2312, ...).

   f. Allows the vendor supplied and the system administrator supplied data to
   be represented in a SINGLE coded character set established at install time.
   (See separate e-mail on how this works in current vendor implementations).

CON:
   g. Harder for applications that want to process the values returned from 
   MIB (as opposed to simplying displaying data which is usually handled by
   the platform), if the data includes values in 128 to 255 and the 
   application needs to support more than one possible coded character set
   that the system administrator could have specified at install time.  For 
   example, if the application is supporting the Western Hemisphere and 
   Western Europe, the application might need to support, ISO Latin1, 
   HP Roman8 and Code page 850, depending on the customer's environment.
   Similar situation for Asia where the application might have to support
   Unicode/UTF-8, JIS X0208, and GB2312.

   e. If the coded character set specified for the MIB is different from
   that supported by the host platform in which the application is running, 
   the application will have to perform code conversion in order to display
   the coded character set data to the user.

   f. Complicates the system administrator install procedures, since the
   information on the install floppy needs to be represented in different
   coded character sets.

5. Only UTF-8 in 32 to 126 and 128 to 255; unused: 0-31, 127
   (David Kellerman's proposal)

   - Allow only UTF-8 which is US-ASCII in 32-126 and a multi-byte character
     encoding scheme in 128 to 255 that represent the characters of 
     ISO 10646 (Unicode).
   - Indicate that code positions 0-32 and 127 SHALL NOT be used.  
   - Also add a proper reference to the ANSI X3.4:1968 that specifies ASCII
     and a reference to UTF-8.

   PRO:
   a. It is the recommendation of the IAB in RFC 2130 for "new [Internet] 
   protocols or new versions of old protocols" to use UTF-8 as the "default".
   The job monitoring MIB is a *new* protocol, we should take the opportunity
   to follow the IAB recommendation now.

   c. Only a single coded character set is permitted, so that applications
   only have to deal with a single fixed coded character set at design time,
   namely UTF-8.

   e. ISO 10646 and UTF-8 are winning support in many quarters for actual 
   implementation.  NT has Unicode as its internal code set.  IPP specifies 
   all text attribute values in UTF-8.  Novell Netware 4.2 supports Unicode.

   f. Don't need to add an object or an attribute to indicte the coded
   character set of either the 3 objects/attributes that don't come from the
   job submitter and the 19 that do.

   CON:

   a. The same paragraph of RFC 2130 (page 3) goes on to say: "These defaults
   do not deprecate the use of other character sets when and where they are
   needed; they are simply intended to provide guidance and a specification
   for interoperability.  
   In fact, RFC 2130 does not even mention SNMP as one of the Internet
   Protocols.  I wonder why?  Because SNMP is more likely to be deployed on
   a LAN, not the Internet?

   b. Most current job submission protocols do NOT use UTF-8, so that the
   agent SHALL have to perform code conversion to UTF-8.

   c. Forces applications to deal with UTF-8, when some applications would be 
   far simpler to just use the coded characer set of the environment.

   d. Many applications do NOT actually need to process the information from 
   the MIB; they merely pass it through to the host platform, which takes care 
   of displaying the information.  Unless the platform supports UTF-8 (or 
   equivalently Unicode, such as NT or Novell 4.1), the applicataion will have
   to convert the coded character set data from UTF-8 to some other coded 
   character set that the host platform can display to the user.

   e. Accpeptance of ISO 10646 (Unicode) in Asian markets has not been 
   enthusiastic.  Many customers have huge investments in data and 
   applications that use their national two byte sets (JIS X0208 Japanese and 
   GB2312 Chinses).  So the Asian vendors have not jumped on the ISO 10646
   bandwagon.  Some have, some have not.  I don't have good figures on the
   the size of each camp.  I think it also depends on the application area.
   Code conversion between UTF-8 and JIB X0208 and GB2312 is a hugh 14-bit
   table lookup, i.e., a 16 mega-byte table.

NOTE: both proposals 4 and 5 are upgrades from US-ASCII, so that PRO is not
mentioned with either alternative.