Character Repertories Mail Archive: RE: CR> CR teleconferenc

RE: CR> CR teleconference and Implementor's Guide

From: ElliottBradshaw@oaktech.com
Date: Wed Jan 08 2003 - 13:33:00 EST

  • Next message: McDonald, Ira: "RE: CR> CR teleconference and Implementor's Guide"

    See Rod's notes for some ideas on terminology.

    ------------------------------------------
    Elliott Bradshaw
    Director, Software Engineering
    Oak Technology Imaging Group
    781 638-7534

    ----- Forwarded by Elliott Bradshaw/oaktech/us on 01/08/2003 01:33 PM -----
                                                                                                        
                        "Acosta, Roderick"
                        <Rod.Acosta@AgfaMon To: "'ElliottBradshaw@oaktech.com'"
                        otype.com> <ElliottBradshaw@oaktech.com>
                                                  cc:
                        01/08/2003 01:08 PM Subject: RE: CR> CR teleconference and
                                                   Implementor's Guide
                                                                                                        

    Elliott,

    Some suggestions from a colleague of mine.

    Character set:
               Unicode is the default character set for HTML and XHTML.
               The range of valid Unicode values ranges from hexadecimal 0 to
    10FFFF
                         (decimal 0 to 1,114,111).
               Any valid Unicode character is associated with a codepoint in
    the
    above
                         specified range of scalar numbers.
               Unicode is an "ordered" character set because each character is
    represented
                         by a unique scalar value.

    Transformations or Encodings:
               A Unicode scalar value can be expressed in a variety of digital
    forms, including
                         UTF-8 and UTF-16. "UTF" stands for "Unicode
    Transformaton
    Format".
               UTF-8 and UTF-16 are often called "encodings" because they
    represent
    ("encode")
                         the full range of scalar values.

    Unicode subset: What do we call it?
               The Unicode character set supports a large number of characters
    that
    are derived
    from other legacy character sets such as ISO 8859-x and JIS X 0208. With
    the
    exception of ISO
    8859-1, all legacy characters must be mapped to their equivalent Unicode
    value through an algorirthmic and/or table-driven process. The ordering of
    characters in a legacy character set is not necessarily replicated in
    Unicode.

    What does one call a subset of Unicode values that represent a range of
    characters from a common, legacy character set? We would like to propose
    the
    term "character collection" because
               a. it does not imply any particular ordering
               b. it does represent a closed, enumerable set
               c. it is distinct from "character set"

    /Rod

    -----Original Message-----
    From: ElliottBradshaw@oaktech.com [mailto:ElliottBradshaw@oaktech.com]
    Sent: Wednesday, January 08, 2003 9:41 AM
    To: Jun Fujisawa
    Cc: cr@pwg.org; owner-cr@pwg.org
    Subject: Re: CR> CR teleconference and Implementor's Guide

    Hello Fujisawa-san,

    Thanks for the useful information. I think we can get a lot of what we
    need from the Japanese Profile document.

    I am not entirely satisfied by the term "repertoire", and would like to
    have some discussion in the group. We are looking for a term that means
    "named subset of Unicode characters, without regard to encoding."
    Bluetooth uses "repertoire" in this way. Some other ideas:

      -character complement
      -Unicode Subset
      -CCSS (Coded Character SubSet)

    I'd like proposals for the term, as well as how we will actually define it.

    With regard to Shift-JIS, I now understand that there is no universal
    mapping from it to Unicode. And, many Japanese web pages still use
    Shift-JIS. So, we may want to recommend that a Japanese-capable printer
    support Shift-JIS as well as UTF-8, and that a Japanese-capable client use
    Shift-JIS if it is available. Otherwise the client must map to Unicode,
    and deal with the ambiguities of the different available mappings. I
    wonder how strongly we should follow Microsoft's lead in this area...

    ------------------------------------------
    Elliott Bradshaw
    Director, Software Engineering
    Oak Technology Imaging Group
    781 638-7534

                        Jun Fujisawa

                        <fujisawa.jun@ca To:
    ElliottBradshaw@oaktech.com
                        non.co.jp> cc: cr@pwg.org

                        Sent by: Subject: Re: CR> CR
    teleconference and
                        owner-cr@pwg.org Implementor's Guide

                        01/06/2003 05:43

                        AM

    Hello Elliott,

    At 2:16 PM -0500 03.1.3, ElliottBradshaw@oaktech.com wrote:
    >As our main topic I would like to go through the draft Implementor's
    Guide,
    >which I have placed at:
    >ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/CRimplementorsGuide.htm.

    I would like to point out that the terms "repertoire" and "character set"
    as
    defined in Terminology section does not seem to be consistent with the
    usage
    in W3C Character Model.

    For example, the use of therm "character set" is discouraged in Section
    3.6.2
    of Character Model for the World Wide Web 1.0

    - Character Model for the World Wide Web 1.0
    <http://www.w3.org/TR/charmod/>

    >As before, my biggest challenge is finding online, normative material for
    >the details of the Asian character sets (except Korean, which is covered
    in
    >an RFC).

    Unfortunately, the only normative materials to the definitions of Japanese
    coded character sets (CCS) are Japanese national standards.

    - JIS X 0201
    Japanese Industrial Standards Committee. 7-bit and 8-bit coded character
    sets for information interchange, JIS X 0201:1997, Japanese Standards
    Association, 1997.

    - JIS X 0208
    Japanese Industrial Standards Committee. 7-bit and 8-bit double byte coded
    KANJI sets for information interchange, JIS X 0208:1997, Japanese
    Standards Association, 1997.

    - JIS X 0212
    Japanese Industrial Standards Committee. Code of the supplementary Japanese
    graphic character set for information interchange, JIS X0212:1990,
    Japanese Standards Association, 1990.

    - JIS X 0221
    Japanese Industrial Standards Committee. Universal Multiple-Octet Coded
    Character Set (UCS) -- Part 1: Architecture and Basic

    Also, I suggest to consult the following W3C Note for the detailed
    information
    on some Japanese character encoding schemes (CES) and their mappings to
    Unicode.

    - XML Japanese Profile
    <http://www.w3.org/TR/japanese-xml/>

    --
    Jun Fujisawa
    <mailto:fujisawa.jun@canon.co.jp>
    

    (See attached file: att25f6a.dat)




    This archive was generated by hypermail 2b29 : Wed Jan 08 2003 - 13:34:23 EST