Manual Page Result
0
Command: iso10646 | Section: 5 | Source: Digital UNIX | File: iso10646.5.gz
Unicode(5) File Formats Manual Unicode(5)
NAME
Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, iso10646 - Sup-
port for the Unicode and ISO/IEC 10646 standards
DESCRIPTION
The operating system provides locales and codeset converters that sup-
port the following standards: The Unicode Standard, Version 2.1, Uni-
code, Inc., 1998
See the Unicode Technical Report #8, http://www.unicode.org/uni-
code/reports/tr8.html, for detailed information about changes to
this version of the standard. Information Technology-Universal
Multiple-Octet Coded Character Set, ISO/IEC 10646:1993
The Basic Multilingual Plane defined by this standard is identi-
cal with the main body of Unicode character encoding.
These standards define generalized character encoding rules that can be
applied to characters in most native language scripts. The Unicode
Standard specifies a universal character set (UCS) that contains defin-
itions in Version 2.1 for 38,887 characters and also includes a Private
Use Area for vendor- or user-defined characters. The following list
summarizes the main features of this character set: All characters are
treated as 16-bit units. Each 16-bit unit has an abstract character
identity. Certain sequences of 16-bit characters in a text stream are
transformed into other characters, called composed characters. Charac-
ters have properties, such as base, numeric, spacing, combination, and
directionality. The Unicode standard provides rules for ordering char-
acters with different properties so that parsing of character sequences
is unambiguous. The relationship between Unicode characters and the
glyphs in the native language script that users see, type, or print is
not necessarily one-to-one. A glyph may be mapped to a single abstract
character or a composed character. Conversely, more than one glyph can
be mapped to a character. The ISO 8859-1 character set occupies the
first 256 code positions (and the ASCII character set the first 128 po-
sitions) of the UCS.
The ISO/IEC 10646 standard specifies both 16- and 32-bit units for each
abstract character defined in the the UCS. The 16-bit character values
in Unicode are zero-extended through a second 16-bit unit in the larger
encoding format. The second, or low-surrogate, 16-bit unit is reserved
for future use in both standards.
The Unicode and ISO/IEC 10646 standards specify a uniform character
size and allow character units to be processed for all languages by us-
ing the same set of rules. Therefore, system support for the universal
character set does not need to include multiple algorithms (one or more
per language) for converting between file code and internal process
code. However, the two different character sizes (16-bit or 32-bit)
that the standards support require different parsing schemes for data
input and output. Universal character encoding that an implementation
parses in 16-bit units (2 octets) is known as UCS-2. This is the
canonical Unicode encoding in wide use on PC systems. Universal charac-
ter encoding that an implementation parses in 32-bit units (4 octets)
is known as UCS-4. This is the canonical ISO/IEC 10646 encoding that is
in use on systems that can support the larger data unit size.
The standards define four transformation formats for the universal
character set. For the most part, the following UCS transformation
formats (UTFs) exist to transform UCS values into sequences of bytes
for handling by various byte-oriented protocols: UTF-8, the standard
method for transforming UCS-4 encoding into a sequence of 8-bit bytes
and ensuring interchange transparency for characters in C0 code posi-
tions (0 to 31), the SPACE (32) character, and the DEL (127) character
UTF-7, the standard interchange format for environments that strip the
eighth bit from each byte UTF-1, which is similar to UTF-8 but also en-
sures interchange transparency of characters in C1 code positions (128
to 159) UTF-16, which handles the surrogate character extensions de-
fined by Version 2.0 of the Unicode Standard. These extensions allow
representation in 2-byte encoding units of characters whose values in
UCS-4 are outside the range normally allowed by a 16-bit length re-
striction. When data includes these characters, the UTF-16 transforma-
tion format enables data exchange between applications using UCS-4 and
applications that require the data to be in UCS-2 (2-byte) format. Al-
though UTF-16 does not support representation of the entire UCS-4 code
space, it supports all characters (except those in certain private-use
ranges) that have been currently defined for the languages covered by
both standards.
DIGITAL UNIX provides locales and codeset converters that provide sup-
port for UCS-4 and UTF-8. The operating system supports UCS-2 only
through codeset converters, which transform data to UTF-16 format. The
operating system provides no support for the UTF-1 and UTF-7 transfor-
mation formats
Codeset Conversion
Codeset converters are available to convert data in all the major en-
coding formats that the operating system supports to and from UCS-2,
UCS-4, and UTF-8. If the worldwide support subsets are installed on
your system, you can enter the following commands to find the names of
these converters: % cd /usr/lib/nls/loc/iconv % ls | grep UTF % ls |
grep UCS
Among the converters listed, you will find some that handle conversion
of data in the code-page format used on PC systems. See the
code_page(5) reference page for more information about converting be-
tween codeset and code-page formats. All codeset converters can be
used with the iconv command and associated library functions.
Note
There was a change in mapping of Korean Hangul characters between Ver-
sion 1.1 and Version 2.0 of the Unicode Standard. By default, UCS-2,
UCS-4, and UTF-8 conversion assumes Version 2.0 character mapping for
Hangul characters. Therefore, if data is in Version 1.1 format, the
data must first be converted to Version 2.0 format before converting
from UCS-2, UCS-4, or UTF-8 to an entirely different format. The format
of a codeset converter name is from-codeset_to-codeset. In converter
names, the Version 1.1 codeset formats for UCS-2, UCS-4, and UTF-8 are
represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNICODE-1-1-UTF-8,
respectively. The Version 2.0 codeset names are represented by UCS-2,
UCS-4, and UTF-8. For example, if Korean data is currently in UCS-4
Version 1.1 format, the data must first be processed by the UNI-
CODE-1-1-UCS-4_UCS-4 converter before being processed by the
UCS-4_deckorean converter.
See the iconv_intro(5) reference page for general information on code-
set conversion.
Locales
The following locales use UCS-4 as internal processing code: univer-
sal.UTF-8
This locale converts data in UTF-8 file format to UCS-4 process
code. The locale can be used to test any UCS-4 character to de-
termine if it is included in one of the following classes de-
fined for the LC_CTYPE category: alnum, alpha, blank, cntrl,
digit, graph, lower, print, punct, space, upper, or xdigit.
In the universal.utf8@ucs4 locale, the LC_MESSAGES, LC_MONETARY,
LC_NUMERIC, and LC_TIME category definitions match those for the
POSIX (C) locale. native_locale_name@ucs4
These locales (for example, fr_FR.ISO8859-1@ucs4) perform the
same function as the universal.UTF-8 locale but are different in
the following ways: The file code is specified by the codeset
portion (for example, ISO8859-1) of native_locale_name. Classi-
fication information is not provided for the full set of UCS-4
characters, but only for those in a particular native language
(for example, French). Country-specific data is also available
to the application. The LC_COLLATE, LC_MESSAGES, LC_MONETARY,
LC_NUMERIC, and LC_TIME category definitions match those defined
in native_locale_name. language_territory.UTF-8
These locales (for example, fr_FR.UTF-8) are similar to the
@ucs4 locales in limiting classification information to the
characters in a particular native language and making country-
specific data available to the application. However, the locales
assume file data follows UTF-8 encoding rules and are the only
locales that support the Euro monetary character (C=).
Note
CDE desktop users can select locales by choosing names followed by
(Unicode) from the CDE language menu at session startup. In this case,
the locale setting applies by default to all applications run during
the CDE session. However, users still using the DECwindows environment
can select locales only by setting a locale environment variable (LANG
or LC_ALL) from a terminal emulation window. When a locale is set in a
terminal emulation window, the locale setting applies only to child ap-
plications invoked from the parent window after the locale setting was
made.
Unicode Character Database
For the convenience of programmers, the source file for the Unicode
character database (Version 2.1.5) is available online. This source
file is the one used to build the locales provided in optional software
subsets included with the operating system product. If the locales are
installed on your system, both the Unicode character database and an
associated ReadMe file are also installed in the /usr/share/unidata di-
rectory. The ReadMe file discusses the character properties supported
by Unicode.
Font Support
The operating system provides the following types of bitmap fonts for
UCS characters: Public domain Unicode fonts:
-etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1 -etl-
fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1 -etl-fixed-
medium-r-normal--24-240-72-72-c-120-iso10646-1 Composite fonts
that the libfr_FGC font renderer creates by combining fonts
available for other codesets
These fonts currently cover only a subset of the characters in UCS.
Each of the ETL public domain fonts supports about 1000 characters, but
does not include any characters for Chinese, Japanese, or Korean. The
composite fonts created by the font renderer are generated only from
fonts available for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9)
codesets.
Refer to iso8859-1(5) and iso8859-15(5) for the names of fonts avail-
able for Latin-1 and Latin-9 characters. Note that the Latin-9 fonts,
which include glyphs for the Euro character, provide the best support
for the language_territory.UTF-8 locales, which also support this char-
acter.
For information on printer support and converting bitmap font encoding
to PostScript, see i18n_printing(5) and wwpsof(8).
SEE ALSO
Commands: locale(1), wwpsof(8),
Others: ascii(5), code_page(5), iso8859-1(5), i18n_intro(5),
i18n_printing(5), iconv_intro(5), l10n_intro(5)
Unicode(5)