PMP Mail Archive: Re: PMP> what happens if we use UTF-8 in place of ASCII

Re: PMP> what happens if we use UTF-8 in place of ASCII

David_Kellerman@nls.com
Fri, 25 Jul 1997 14:21:02 PST

> There are no 'levels' of UTF-8. It is an encoding compatible
> with US-ASCII in the 'left-half' of any octet of the FULL
> UCS-2 Unicode set of 2-byte code points. For Western European
> languages, it probably averages 2.5 bytes per character.
> For non-Western languages, it takes up to 4 bytes per character.

Ira, help me out here. (I know just enough on this topic to be
dangerous.) What I assumed for Latin languages was that in typical
text, 90-95% of the characters would be present in ASCII, and you'd only
use the multi-byte sequences for diacriticals and a handful of
non-English characters. So typical expansion would be 10-20% over what
you'd get with a single byte encoding like Latin 1. What am I missing?

David

:: David Kellerman Northlake Software 503-228-3383
:: david_kellerman@nls.com Portland, Oregon fax 503-228-5662