Character Repertories Mail Archive: CR> RE: Value matching i

CR> RE: Value matching in CR

From: McDonald, Ira (imcdonald@sharplabs.com)
Date: Sat Mar 22 2003 - 18:50:10 EST

  • Next message: ElliottBradshaw@oaktech.com: "CR> New CR document: Standard for Character Repertoire Interoperabiliy"

    Hi Elliot,

    All existing UNIX implementations of POSIX locales do locale name
    matching (language/charset concatenations) based on the rules at
    (1) below. But POSIX itself does not formalize this matching
    rule (anywhere I've been able to find so far).

    (1) Only for purposes of comparing two character repertoire names,
        Printers (or Clients) MUST:
        (a) convert all letters to lowercase;
        (b) remove all hyphens, underscores, and periods; and
        (c) truncate semi-colons (year of standard version separators)
            and any trailing date info

    Although the character set with the common alias "Latin 1" has been
    registered with a 'Name:' of "ISO_8859-1:1987" in the IANA Charset
    Registry, it is also VERY commonly referred to by existing software
    as "iso8859-1" or "iso-8859-1" or "iso_8859.1" (notice the typical
    misuse of periods and inconsistent presence of hyphen after "iso").

    It is highly desirable that IPP/PSI Printers/Clients behave like
    Web search engines and accept all approximate matches as equal.

    (2) For purposes of displaying supported character repertoires in
        the future "repertoire-supported" Printer object attribute,
        Printers MUST:
        (a) use a 'namespace' prefix from the PWG CR standard (such
            as "unihan") in all lowercase, followed by a hyphen;
        (b) use the best practice name of the base charset - for the
            "iana" prefix, this MUST be the registered 'Name:' value
            (complete with the year of standard suffix after a colon)
            and MUST NOT be any registered 'Alias:' value. However,
            this value MUST be normalized to lowercase, consistent
            with the existing 'charset-supported' Printer attribute
            semantics. And any imbedded underscores MUST be changed
            to hyphens for consistency.

    I'd like to say it's OK to retain the colon/date info for the
    comparisons, but it's really not safe, practically speaking.

    Note that the existing "charset-supported" attribute says that
    Printers MUST use the 'Name:' value and MUST NOT use any of the
    'Alias:' values from the IANA Charset Registry.

    An interesting sidelight: The Printer MIB (RFC 1759) uses the enum
    tags that are 'Alias:' values beginning with "cs" (and containing
    NO punctuation characters at all, as recommended by SMIv2 for MIBs).
    When the Printer MIB is "visible" through the future PWG WBMM
    interface (and the new Printer Device in the PWG Semantic Model),
    we'll be faced with another interesting name collision. Sigh...

    Cheers,
    - Ira McDonald
      High North Inc

    -----Original Message-----
    From: ElliottBradshaw@oaktech.com [mailto:ElliottBradshaw@oaktech.com]
    Sent: Friday, March 21, 2003 11:50 AM
    To: McDonald, Ira
    Subject: Value matching in CR

    Hi Ira,

    I've been fiddling with the rules for matching CR values...in the last
    version I said that hyphens and underscores would be dropped before
    comparison. This may be a bit drastic...what if we say that a hyphen
    matches an underscore?

    Also, I think you said there was some reference would could use on the
    subject. True?

      Thanks,
      E.

    ------------------------------------------
    Elliott Bradshaw
    Director, Software Engineering
    Oak Technology Imaging Group
    781 638-7534



    This archive was generated by hypermail 2b29 : Sat Mar 22 2003 - 18:50:39 EST