PMP> what happens if we use UTF-8 in place of ASCII

PMP> what happens if we use UTF-8 in place of ASCII

David_Kellerman at nls.com David_Kellerman at nls.com
Fri Jul 25 18:21:02 EDT 1997


> There are no 'levels' of UTF-8.  It is an encoding compatible
> with US-ASCII in the 'left-half' of any octet of the FULL
> UCS-2 Unicode set of 2-byte code points.  For Western European
> languages, it probably averages 2.5 bytes per character.
> For non-Western languages, it takes up to 4 bytes per character.


Ira, help me out here.  (I know just enough on this topic to be
dangerous.)  What I assumed for Latin languages was that in typical
text, 90-95% of the characters would be present in ASCII, and you'd only
use the multi-byte sequences for diacriticals and a handful of
non-English characters.  So typical expansion would be 10-20% over what
you'd get with a single byte encoding like Latin 1.  What am I missing?


David


::  David Kellerman         Northlake Software      503-228-3383
::  david_kellerman at nls.com Portland, Oregon        fax 503-228-5662



More information about the Pmp mailing list