This Annex to the Guide to the Use of Character Sets in Europe provides more detailed information about the Universal Multi-octet Coded Character Set (UCS) specified in ISO/IEC 10646-1 than is found in the main body of the Guide. Annex A deals in more detail with 8-bit character set standards.
Table of Contents
1 Introduction *
2 Nature of character data *
2.3 Alphabetic, syllabic and ideographic scripts *
2.4 Sequence order and writing mode *
2.5 Precomposed and decomposed characters *
3 Coding of character data *
3.3 Limitations of two-octet codes *
3.4 The four-octet structure of the UCS *
4 Basic Multilingual Plane (BMP) *
4.3 Alphabetic and syllabic scripts of the A-zone *
4.4 Unified ideographs of the I-zone *
4.5 The Hangul syllabics of the O-zone and Yi *
4.6 The restricted use R-zone *
5 Visual representation of characters *
5.3 Use of multiple combining characters *
6 Referencing of characters *
6.3 Linguistic translation of character names *
6.4 Unique identifiers for characters *
6.5 Unique identifiers for glyphs *
7 UCS – Repertoires and subsets *
7.3 Collections and subsets *
7.4 Significance of subsets for conformance to the UCS *
7.5 Subsets as an aid to migration from 8-bit codes *
8 UCS – Coding methods of the UCS *
8.3 UCS-4: Four-octet canonical form *
8.4 UTF-16: UCS Transformation format 16 *
8.5 UTF-8: UCS Transformation format 8 *
9 UCS – Serial transmission of the UCS *
10 UCS – Use of control functions with the UCS *
10.3 The use of control functions with the UCS *
10.4 Identification of UCS subsets by use of control functions *
10.5 Invocation of the UCS from an 8-bit code *
1.1 Origins and aims of the UCS
The Universal Multiple-Octet Coded Character Set, more simply known as the UCS, is intended to provide a single coded character set for the encoding of the written forms of all the languages of the world and of a wide range of additional symbols that may be used in conjunction with such languages. It is intended not only to cover languages in current use, but also languages of the past and such additions as may be required in the future.
The coding provided by the UCS is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written forms of the languages.
To achieve these aims, the UCS is a multi-part standard under continuous development. The first edition of part 1 was published in 1993 as:
- ISO/IEC 10646-1:1993, Information technology – Universal Multiple-Octet Coded Character Set (UCS) – Part 1: Architecture and Basic Multilingual Plane.
At the time of writing, two Technical Corrigenda and Amendments 1 to 9 (Cor.1-2, AMD.1-9) have been published. Amendments 10 to 27 are in preparation. This guide covers both the base standard and the latest available texts of all these corrigenda and amendments.
The Basic Multilingual Plane (BMP) referred to in this title is a subset of the full UCS that may be encoded in 16 bits, so providing for a total of 65,536 character positions of which so far a large proportion have been allocated. The full UCS allows for 31-bit coding (there is a 32nd bit that is constrained to be zero) and so provides for over two thousand million characters. It should therefore have ample space to fulfill its intention of covering all languages.
For many applications of the UCS, the characters of the BMP are all that will be required. It would be very wasteful of resources if a 32-bit coding was imposed on applications that required only a subset that could be encoded in 16 bits. The UCS therefore specifies more than one form of coding for its characters, in particular providing for encoding of the BMP in a 16-bit form.
The UCS standard will be extended in future by the publication of further parts and of further editions of the existing part 1. Future editions incorporate all published corrigenda and amendments issued prior to their publication. They may in addition include further changes that have not been published separately in this way. It is the declared intention that all such extensions of the UCS will be upwardly compatible, i.e. that they will add the coding of additional characters but that once included, no character will be withdrawn or have its coding changed. The scope of the standard is, however, so wide that such an intention is difficult to maintain. It has, indeed, already been broken in published corrigenda and amendments. Nevertheless it is hoped that it will not be necessary in future to make any further exceptions to this important feature.
The UCS has been developed under the auspices of Joint Technical Committee 1 (JTC 1) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). ISO maintains a World Wide Web site, which includes its catalogue and ordering information, at the URL http://www.iso.ch.
The JTC 1 subcommittee responsible for the UCS is SC 2, which maintains an official information service at the URL http://www.dkuug.dk/JTC1/SC2.
The UCS is closely related to a commercial character encoding called UNICODE™, prepared by The Unicode Consortium (e-mail: [email protected]) and published as The Unicode Standard, Worldwide Character Encoding which is now at Version 2.1. Information concerning UNICODE™ is available at the URL http://www.unicode.org.
Roughly speaking, UNICODE™ can be regarded as being the 16-bit coding of the BMP of the UCS. There is effective cooperation between the Unicode Consortium and ISO/IEC JTC 1/SC 2 which should ensure that this compatibility is maintained in future enhancements to the BMP. However, UNICODE™ is not simply the BMP of the UCS as it includes guidelines for usage that are not present in the equivalent ISO standard.
The restriction of UNICODE™ to containing only the BMP of the UCS increases the significance of the positioning of characters in future additions to the UCS. More details of the organization of the BMP are given in section 4 of this guide.
2.1 Characters, character names and glyphs
To understand the role of the UCS in the electronic representation of character data, we first need to consider what is meant, in this context, by a character. The instinctive view of a character, which must be our starting point, is that it is the basic element of some writing system, such as a letter of an alphabet or an ideograph of an ideographic writing system. But this view needs refinement in the context of such an ambitious project as the coding of all the languages of the world.
Characters are identified in their written form by their shape, which is an imprecise concept arising from the ability of the human brain to recognize that two distinct and non-identical objects have the same “shape” It is this ability that enables us to read handwriting, different typefaces, etc. It is a learned ability; most Western people have difficulty in telling whether two similar written Chinese ideographs are in fact “the same character”. But it exists and we have to accept that there is an abstract concept of “shape” that underlies the entire nature of written language.
Subtleties enter when we realize that there is context dependence to the recognition of written characters. There are letters with the same shape in the Latin and Greek alphabets, for example, but we do not think of them as the same character. The shape for a Latin capital letter A is recognized as a Greek capital letter alpha when it appears in Greek text. A hyphen is interpreted as a minus sign when it appears in mathematical expressions. Are Greek capital letter omega (½) and the Ohm sign (symbol for electrical unit of resistance), the same character or not? Historically the Greek letter was adopted as the Ohm sign, but it is a question of opinion as to whether it has by usage now become a symbol in its own right. The viewpoint of the UCS is that they are now distinct characters.
There are also subtleties in the opposite direction. The Greek language uses two distinct written forms for the Greek small letter sigma, depending on whether it is (ς), or is not (σ), the final letter of a word. Printed text often makes use of ligatures (joined letters) for reasons of appearance that have no linguistic basis. For example, printed text in the Latin alphabet often combines a small letter F followed by a small letter I into an ligature:
f + i =
This creates a recognizably distinct shape but it is interpreted as two distinct letters when it is read. These are examples where the shape that represents the character or characters is affected by the context in which the character appears.
Which of these subtleties is important, for the purposes of the electronic encoding of data, depends substantially on the use to which the coding is to be put. A particular application of encoded data is normally concerned either with the visual appearance of encoded symbols, e.g. for printing applications, or with the semantics of the encoded symbols, e.g. for data processing. This has given rise to two distinct concepts arising from our first idea of a character as the basic element of some writing system. Elements of written data that are distinguished from one another by visual appearance are known as glyphs. The term character has become specialized to mean elements of written data that are distinguished from one another by semantic interpretation. The formal definitions are as follows:
- character: A member of a set of elements used for the organisation, control, or representation of data (taken from ISO/IEC 10646-1:1993).
- glyph: A recognizable abstract graphic symbol which is independent of any specific design (taken from ISO/IEC 9541-1:1991).
Characters are distinguished from one another by name, not by form or shape. ISO standards for coded character sets normally include tables that show a representative printed form for each character represented. These printed forms are purely illustrative and are not necessarily distinctive; the same shape (glyph) may be used for more than one character in a table. It is the name, such as LATIN CAPITAL LETTER A, that identifies the character being encoded in each code position. It is a convention adopted by the UCS that the names of characters are composed only from Latin capital letters A to Z, digits 0 to 9, space and hyphen. There are restrictions on the use of digits in names, in particular they may only be used in the names of ideographic characters.
With this distinction in place, we can say that the UCS is a standard that specifies an encoding of characters. The standard shows a representative printed form (glyph image) for each encoded character, but these are not all distinct from one another.
2.2 Graphic characters and control characters
The characters described in the preceding section are all graphic characters, i.e. characters that have a visual representation. Character data also includes characters present for control purposes, such as CARRIAGE RETURN or LINE FEED. These particular control characters have names that originate with the use of electromechanical teleprinters, but they are still used today for the characters used to control paragraph separation in modern text processing systems. They are just two examples of many such non-printing characters that may be required to control the systems used for the display or printing of coded character data.
When data is encoded directly as a sequence of characters, such control characters will appear interspersed in the sequence of graphic characters. They must therefore be assigned code positions along with the graphic characters of the code. Nowadays character data is often transmitted or otherwise processed by means of protocols that separate the control data from the character data. One such protocol is Abstract Syntax Notation One (ASN.1). When such protocols are used, it is not necessary to keep code positions for control characters within the code used for graphic characters as the separation is achieved by other means. However, the UCS does reserve code positions for the use of control characters, to permit use in systems where a single sequence of intermixed graphic and control characters is required.
2.3 Alphabetic, syllabic and ideographic scripts
The world’s languages whose characters are encoded in the UCS differ substantially from one another in the extent to which the written forms of the languages can be broken down into constituent elements. The scripts used for written languages fall, for this purpose, into three distinct classes:
- alphabetic scripts, in which the number of distinct letters used in writing is limited and at most a few hundred;
- syllabic scripts, where written symbols each represent a language syllable, in which the number of distinct syllables is limited but may run into several thousand; and
- ideographic scripts, where there is in principle no limit on the number of different ideographs that may be used in writing, other than that imposed by the vocabulary of the language.
In these descriptions the meaning of “limited” is that the number will not increase as further words are added to the language, either in the future or to cover language usage in the past.
These different classes of script have very different requirements in terms of the number of code positions required to represent them in a coded character set. All the alphabetic scripts of the world, taken together, require fewer code positions than does the Chinese ideographic script on its own. The UCS sets aside somewhat over one quarter of its code space in the BMP, a total of 20992 code positions, for the East Asian ideographic scripts of the Chinese, Japanese and Korean languages taken together. A further 11172 code positions are occupied by the Korean Hangul syllabic script. This leaves somewhat over one half of the code space of the BMP for all other scripts of all the other languages of the world that are in current use. This is likely to be more than adequate. The effect of the limitation of space on the encoding of the ideographic scripts is described section 4.4.
2.4 Sequence order and writing mode
The written form of a language is composed of a sequence of script elements. This is true whether the script is alphabetic, syllabic or ideographic. But languages differ from one another in the arrangement of the sequence on paper (or other writing surface). Three arrangements are in common use. The succession of script elements may be written left-to-right (e.g. Latin, Cyrillic and Greek scripts and horizontal Japanese Kanji) or right-to-left (e.g. Hebrew and Arabic scripts), with successive rows being written top-to-bottom, or the script elements may be written top-to-bottom (e.g. vertical Japanese Kanji) with successive rows being written right-to-left.
The sequence order of the characters in an encoding of any script is that of the logical succession of characters, regardless of the writing mode. If the encoding is to be used to create a written presentation of the encoded material, it is up to the application to observe the correct writing mode for the script in use. This is so even for encoded data that intermixes two or more scripts with different writing modes, e.g. text in Latin script containing Hebrew quotations. Where it is required to encode the intended writing mode along with the character data, the control functions SELECT PRESENTATION DIRECTIONS and START REVERSED STRING may be used. Their coding, which makes use of control characters, is defined in ISO/IEC 6429:1992. The first of these functions is used to set the writing mode of the main text. The second is used to reverse the direction temporarily, as in the example of Hebrew quotations within a predominantly Latin script.
Certain characters have semantics that depend on writing direction. The symbols “(” and “>” represent an opening parenthesis and a greater-than sign when they occur in a script written from left to right, but in a script written from right to left they represent a closing parenthesis and a less-than sign respectively. There are provisions within ISO/IEC 10646-1 for such characters to be presented in mirrored form, “)” and “<” in this example, when used with a script written from right to left. However, such mirroring should not be performed automatically since there are separate characters which have these glyphs as their normal form. Specific rules governing such forms of presentation that are given in annexes C and D of ISO/IEC 10646-1.
2.5 Precomposed and decomposed characters
Even within alphabetic scripts, there is ambiguity as to what are the constituent elements of the script. Many scripts use diacritical marks, such as accents and tone marks, as modifiers of basic letters. At what point does one cease the decomposition? Is ? (e circumflex) a letter in its own right or a composite of two separate elements, a letter (e) and an accent (circumflex)? If it is to be regarded as a composite on the grounds that the letter and accent are separated from one another, then what about i (small letter I)? A dotless i is a letter in its own right in the Turkish language. And what about ¿ (o with stroke), which is a superposition that is not visually in two distinct parts?
The UCS has adopted the view that basic letters and diacritical marks should be assigned encodings as separate graphic characters, but that the composites that are in normal use in current languages should also be encoded as graphic characters in their own right. A character such as a diacritical mark, intended only to be used in conjunction with a base letter, is said in the UCS to be a combining character. A composite formed from a base letter and one or more combining characters is called a composite sequence.
A composite sequence is not a character, as it is not a member of the set of elements that form the UCS; it is a sequence of such elements. But both graphic characters and composite sequences have visual representations as glyphs, and the same glyph may be the visual representation both of a graphic character of the UCS and of a composite sequence. The glyph ? (e acute) is the visual representation both of the character LATIN SMALL LETTER E WITH ACUTE and the composite sequence LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT.
3.1 Fixed and variable length codes
A coded character set provides an unambiguous relationship between the characters of a specified set and sequences of binary digits (bits) that are used to represent these characters. One of the most well known coded character sets is ASCII, the American Standard Code for Information Interchange. This represents 128 characters and uses all possible combinations of 7 bits. But there is no reason other than convenience why each character should be represented by the same number of bits, provided that the structure of the code permits the boundaries between one character coding and the next to be distinguished.
Modern codes used for the interchange and processing of information encode each character by one or more octets, an octet being a sequence of 8 bits. A code is fixed length if each character of the code is represented by the same number of octets and is of variable length if this is not the case. One standardized code of variable length for the Latin script is that specified in
- ISO/IEC 6937, Information technology – Coded graphic character set for text communication – Latin alphabet.
In this code, letters without diacritical marks, and other symbols, are encoded by a single octet. Letters with diacritical marks are considered to be characters in their own right but they are encoded by a sequence of two octets. The first octet identifies the diacritical mark and the second identifies the base letter. This ordering originates with electromechanical printing equipment in which diacritical marks are non-spacing characters. The following character is then superposed on the diacritical mark to form an accented character.
The use of non-spacing diacritical marks in the variable length encoding of ISO/IEC 6937 differs in principle from the use of combining characters in the UCS, but the practical effect is similar. In ISO/IEC 6937 the non-spacing diacritical marks are not considered to be characters in their own right. The octet that represents a particular non-spacing diacritical mark is not a valid encoding on its own, instead it carries an implicit signal that it is to be followed by a second octet in order to complete the encoding.
In contrast to ISO/IEC 6937, the combining characters of the UCS are characters in their own right but they do not have a visual representation on their own. A glyph, giving a visual representation, is only associated with a complete composite sequence in which combining characters form only a part. A code with such a structure is capable of representing more glyphs than there are characters in the code.
3.2 Inadequacy of single-octet codes
The ASCII 7-bit code reserved the first 32 of its 128 code positions for control characters. Of the remaining 96 positions, one was used for the SPACE character and another for a DELETE character, so only 94 positions were left for graphic characters.
Due to the influence of ASCII on the development of coded character sets, early 8-bit codes were structured as two 7-bit codes, conceptually with a left-hand and a right-hand code table distinguished from one another by the eighth bit. Each of the 7-bit codes had the first 32 code positions reserved for control characters, but SPACE and DELETE were not required a second time, leaving 96 positions in the right-hand code table for graphic characters.
Such an 8-bit code is very limiting, as even with the use of combining characters it does not contain sufficient space for the base letters of more than one or two alphabetic scripts. A number of single-octet codes with this construction are defined in the multi-part standard ISO/IEC 8859 and further parts are still under development. Each part contains the specification of a single-octet code that includes the graphic characters of ASCII and which makes no use of combining characters. Although an improvement on the 7-bit ASCII code, each part covers the characters required for only a small selection of the world’s languages.
3.3 Limitations of two-octet codes
As there is no possibility of using a single-octet code for an ideographic script, the 8-bit code structure with its ASCII inheritance was extended in the simplest manner that would permit the coding of such scripts. A sequence of two or more octets, each corresponding to a code position for a graphic character from the same half (left-hand or right-hand) of the 8-bit code table, could be taken together to encode a character. A sequence of two octets would then permit 94 x 94 (i.e. 8836), or 96 x 96 (i.e. 9216), characters to be encoded instead of the 94 or 96 permitted by single octet coding. The structure for such codes, together with various code extension techniques, is specified in
- ISO/IEC 2022, Information technology – Character code structure and extension techniques
This provides sufficient space for the encoding of the most commonly used Chinese characters in a single code for use in either the left-hand or right-hand area of such a code structure. The full requirements of Chinese, however, are well illustrated by the Chinese Standard Interchange Code CNS 11643 (1992). This defines 7 such character sets which between them provide for the coding of 48027 Chinese characters. The code extension techniques of ISO/IEC 2022 include code switching mechanisms which permit all such tables to be used in conjunction with one another.
By breaking away from the ISO/IEC 2022 code structure, the full 65536 code positions of a two-octet code become available. There is still the need to provide for the coding of control characters, but this is minimal in comparison with the available space. However, if 48027 of these code positions are required for Chinese characters, a single two-octet code becomes inadequate to cover all the languages of the world.
3.4 The four-octet structure of the UCS
Since the intention of the UCS is to have the capability of covering all characters of all languages, a four-octet structure has been adopted. The most significant bit of the most significant octet is constrained to be zero, which permits its use for private internal purposes in a data processing system. The remaining 31 bits allow for over two thousand million code positions, which should be more than enough for all future needs. For reference the four octets are named, in order from the most significant to the least significant,
- the Group-octet, G-octet or simply G;
- the Plane-octet, P-octet or simply P;
- the Row-octet, R-octet or simply R;
- the Cell-octet, C-octet or simply C.
The entire code space is correspondingly viewed as a four-dimensional structure composed of
- 128 groups, each specified by a value for G;
- 256 planes in each group, each plane specified by a value for P;
- 256 rows in each plane, each row specified by a value for R;
- 256 cells in each row, each cell specified by a value for C.
The values of any octet are specified by two hexadecimal digits 0-9, A-F, in which A through F correspond to the decimal values 11 through 15 respectively. The value of G is restricted to the range 00-7F.
A cell within a plane may be described by four hexadecimal digits giving its R and C values, thus 1234 corresponds to R=12, C=34. In every plane the cells FFFE and FFFF are left unused; FFFE has a special use in signatures for coding identification (see the chapter on serial transmission of the UCS) and FFFF is available whenever a value is required that is guaranteed not to be a valid character code.
The plane with G = 00, P = 00 is known as the Basic Multilingual Plane (BMP). The 34 planes P = 0F, 10, E0-FF in Group 00 and the whole of the 32 groups G = 60-7F are designated as available for private use, outside the scope of standardization. Planes P = 0F, 10 in Group 00 were added by Amendment 1 to those reserved for private use, to enable two private use planes to be accessed by the UTF-16 coding methods specified in that Amendment. There is also a block reserved for private use that lies within the BMP; see section 4.
The UCS is one of two major codes that have developed outside of the constraints of ISO/IEC 2022. The other is the commercially-developed UNICODE™, which was developed strictly as a two-octet code. In the interests of compatibility, ISO and the Unicode Consortium cooperated during the development of both codes to ensure that the BMP coincides with UNICODE™ code table. In addition the UCS standard specifies more than one form for the coded representation of characters. One of these is the two-octet BMP form which, as its name implies, provides for the encoding of the characters of the BMP by the values for their R and C octets alone. This coding is identical to that provided by UNICODE™.
The pressure to ensure that all languages in current use can be represented by UNICODE™, so requiring coding within the BMP, has led to compromises in design that would not have been necessary in a pure four-octet code. Once again it is the space required by the ideographic scripts that is the source of the difficulties. This is explained in more detail in section 4.
4 Basic Multilingual Plane (BMP)
4.1 Relationship to 8-bit codes
For reasons of compatibility, the row of the BMP with R = 00 has been given the structure of an 8-bit code according to ISO/IEC 2022. This requires that
- the code positions 0000-001F and 0080-009F are reserved for the coding of control functions (prior to Amendment 3, only 0000-001F was available for the coding of control functions as 0080-009F was reserved for future standardization);
- the code position 007F is reserved for the DELETE character (for historic reasons that have long since ceased to be relevant);
- the code position 0020 is allocated to the SPACE character.
This enables the coded representation of a control function to be obtained by a simple algorithm from its coded representation in an 8-bit code in accordance with ISO/IEC 2022. The algorithm is described elsewhere in this annex.
The graphic characters in the remaining 190 code positions of row 00 are allocated in accordance with the 8-bit code specified in
- ISO/IEC 8859-1:1997, Information technology – 8-bit single-byte coded graphic character sets – Part 1: Latin Alphabet No.1.
That code, and therefore row 00 of the BMP, contains graphic characters used for general purpose applications in typical office environments in at least the following languages:
- Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Faroese, Finnish, French (with restrictions), Frisian, Gaelic, Galician, German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Spanish and Swedish.
This incorporation of ISO/IEC 8859-1 in particular makes the cells 21-7E of row 00 have the same allocations as the graphic characters of ASCII, which in its internationally standardized form is also known as the International Reference Version (IRV) of:
- ISO/IEC 646:1991, Information technology – ISO 7-bit single-byte coded character set for information interchange.
To aid its interpretation and development, the Basic Multilingual Plane is divided into five zones corresponding to the following code positions:
- A-zone: code positions 0000-4DFF but excluding the positions 0000-001F and 0080-009F reserved for control characters and 007F reserved for the DELETE character (leaving 19903 positions)
- I-zone: code positions 4E00-9FFF (20992 positions)
- O-zone: code positions A000-D7FF (14336 positions)
- S-zone: code positions D800-DFFF (2048 positions)
- R-zone: code positions E000-FFFD (8190 positions)
The R-zone terminates at FFFD as positions FFFE and FFFF are reserved; see the section of this guide on the 4-octet code structure of the UCS.
Each zone has a distinctive use:
- the A-zone is used for alphabetic and syllabic scripts together with various symbols;
- the I-zone is used for Chinese/Japanese/Korean (CJK) unified ideographs;
- the O-zone is used for the Korean Hangul syllabic script, and for various other scripts;
- the S-zone is reserved for use with transformation format UTF-16;
- the R-zone is known as the restricted use zone and contains sets of graphic characters for various uses including private use that is outside the scope of standardization.
The transformation format UTF-16 was introduced by Amendment 1 to the first edition of ISO/IEC 10646-1, which also created the S-zone by a splitting of the O-zone. Prior to that amendment the O-zone extended to code position DFFF. UTF-16 extends the two-octet coding of the BMP into a variable-length coding. In that coding the characters of all zones of the BMP (P=00) other than the S-zone are encoded in two octets while in addition characters of any of the fifteen planes P=01 to P=10 (remember that 10 here is a hexadecimal value) are encoded in four octets.
4.3 Alphabetic and syllabic scripts of the A-zone
The A-zone is structured into named blocks, each consisting of a consecutive range of cells. Each block is allocated to a related set of characters, although a block may contain individual cells that are currently unallocated. The characters in the UCS from a particular script may be grouped together in a single block (such as BENGALI) or they may be divided among several blocks (such as BASIC ARABIC and ARABIC EXTENDED). The characters of the Latin script occupy the first four named blocks BASIC LATIN, LATIN-1-SUPPLEMENT, LATIN EXTENDED-A, LATIN EXTENDED-B but in addition there is one further block of Latin characters, LATIN EXTENDED ADDITIONAL, which occurs further into the code table.
Separate from the block structure, but closely related to it, is the concept of a collection of characters. A collection is the subset of characters allocated to a specified range of cells. The difference between a block and a collection is that the cells of a collection need not be consecutive and two collections may overlap. Collections are assigned both a name and a number. Blocks divide the code space into separate areas that are allocated for a coherent purpose. Collections put blocks and/or individual characters together to form subsets of practical significance. A user may then put several collections together to form a subset meeting a particular need, such as communication in English and Hebrew.
Table 1 shows the blocks and collections of the first nine rows of the A-zone, comprising cells 0000-08FF. It gives both the name and the range of cells that comprise the block. With the exception of the collection HEBREW EXTENDED, which is formed from two blocks, there is a one-to-one correspondence between blocks and collections for the characters in these seven rows. The table also gives the number assigned to the collection in the first column; the collection name is the same as that of the block.
Table 1 : Blocks and Collections of rows 00-08 of the UCS
(collection = block, except for collection 13; *, = contains combining characters; see the section below on combining characters for the significance of these markings)
1 |
BASIC LATIN |
0020-007E |
2 |
LATIN-1-SUPPLEMENT |
00A0-00FF |
3 |
LATIN EXTENDED-A |
0100-017F |
4 |
LATIN EXTENDED-B |
0180-024F |
5 |
IPA EXTENSIONS |
0250-02AF |
6 |
SPACING MODIFIER LETTERS |
02B0-02FF |
7 |
COMBINING DIACRITICAL MARKS |
0300-036F |
8 |
BASIC GREEK |
0370-03CF |
9 |
GREEK SYMBOLS AND COPTIC |
03D0-03FF |
10 |
CYRILLIC |
0400-04FF |
(Reserved for future standardization) |
0500-052F |
|
11 |
ARMENIAN |
0530-058F |
HEBREW EXTENDED-A (31 further Hebrew characters have been allocated to previously reserved cells in this block by Amd. 7) |
0590-05CF |
|
12 |
BASIC HEBREW |
05D0-05EA |
HEBREW EXTENDED-B |
05EB-05FF |
|
13* |
HEBREW EXTENDED (This collection comprises the two blocks HEBREW EXTENDED-A and HEBREW EXTENDED-B) |
|
14* |
BASIC ARABIC |
0600-065F |
15* |
ARABIC EXTENDED |
0660-06FF |
85 |
SYRIAC (added by Amd.27, hence the out-of sequence number) |
0700-074F |
(Reserved for future standardization) |
0750-077F |
|
86* |
THAANA (added by Amd.24, hence the out-of sequence number) |
0780-07BF |
(Reserved for future standardization) |
07C0-08FF |
Certain characters in the blocks LATIN-1-SUPPLEMENT AND LATIN-EXTENDED-B have had their names changed by Technical Corrigendum 1 (1996) since the publication of the first edition of the standard in 1993. In the first of these blocks the characters affected are:
- LATIN CAPITAL LIGATURE AE, renamed to
- LATIN CAPITAL LETTER AE (ash);
- LATIN SMALL LIGATURE AE, renamed to
- LATIN SMALL LETTER AE (ash).
In the other block the affected characters are these same characters with added diacritical marks MACRON or ACUTE. The same name changes will be made in the next editions of the parts of ISO/IEC 8859 in which these characters appear.
The next five rows, 09-0D, are allocated to scripts that require the two special characters
- ZERO WIDTH NON-JOINER (code position 200C)
- ZERO WIDTH JOINER (code position 200D)
in the coding of languages written in those scripts. As with rows 00-06, there is a collection corresponding to each block, but for these rows the collection consists of the characters allocated to that block together with these two special characters.
Table 2 shows the blocks and collections of rows 09-0D of the A-zone, comprising cells 0900-0DFF. It gives both the name and the range of cells that comprise the block. The table also gives the number assigned to the collection that consists of the characters allocated to the block together with the additional characters at positions 200C and 200D. The collection name is the same as that of the block on which it is based.
Table 2 : Blocks and Collections of Rows 09-0D of the UCS
(collection = block + 200C + 200D; * = contains combining characters)
16* |
DEVANAGARI |
0900-097F |
17* |
BENGALI |
0980-09FF |
18* |
GURMUKHI |
0A00-0A7F |
19* |
GUJARATI |
0A80-0AFF |
20* |
ORIYA |
0B00-0B7F |
21* |
TAMIL |
0B80-0BFF |
22* |
TELUGU |
0C00-0C7F |
23* |
KANNADA |
0C80-0CFF |
24* |
MALAYALAM |
0D00-0D7F |
84* |
SINHALA (added by Amd.21, hence the out-of sequence number) |
0D80-0DFF |
The remainder of the first 32 rows, namely rows 0E-1F, are either reserved or allocated to further scripts that correspond to collections on a one-to-one basis without additional characters. These are shown in Table 3.
Table 3 : Blocks and Collections of Rows 0E-1F
(collection = block; * = contains combining characters)
25* |
THAI |
0E00-0E7F |
26* |
LAO |
0E80-0EFF |
72* |
BASIC TIBETAN (added by Amd.6, hence the out-of sequence number) |
0F00-0FBF |
(Reserved for future standardization) |
0FC0-109F |
|
28 |
GEORGIAN EXTENDED (note that the collection number is out of sequence) |
10A0-10CF |
27 |
BASIC GEORGIAN |
10D0-10FF |
29 |
HANGUL JAMO |
1100-11FF |
73 |
ETHIOPIC (added by Amd.10, hence the out-of sequence number) |
1200-137F |
(Reserved for future standardization) |
1380-139F |
|
75 |
CHEROKEE (added by Amd.12, hence the out-of sequence number) |
13A0-13FF |
74 |
UNIFIED CANADIAN ABORIGINAL SYLLABICS (added by Amd.11, hence the out-of sequence number) |
1400-167F |
82 |
OGHAM (added by Amd.20, hence the out-of sequence number) |
1680-169F |
83 |
RUNIC (added by Amd.19, hence the out-of sequence number) |
16A0-16FF |
87* |
BURMESE (added by Amd.26, hence the out-of sequence number) |
1700-177F |
88* |
KHMER (added by Amd.25, hence the out-of sequence number) |
1780-17FF |
(Reserved for future standardization) |
1800-1DFF |
|
30 |
LATIN EXTENDED ADDITIONAL (one additional Latin character has been allocated to a previously reserved cell in this block by Amd.7.) |
1E00-1EFF |
31 |
GREEK EXTENDED |
1F00-1FFF |
The next eight rows of the A-zone contains symbols of various sorts and for various scripts, including technical and special purpose symbols. These take up rows 20-28 and they are followed by a further seven rows that are at present unallocated. This area of the A-zone is structured as in Table 4.
Table 4 : Blocks and Collections of Rows 20-2F
(collection = block; = contains combining characters)
322000-206F |
|||
33 |
SUPERSCRIPTS AND SUBSCRIPTS |
2070-209F |
|
34 |
CURRENCY SYMBOLS |
20A0-20CF |
|
35 |
COMBINING DIACRITICAL MARKS FOR SYMBOLS |
20D0-20FF |
|
36 |
LETTERLIKE SYMBOLS |
2100-214F |
|
37 |
NUMBER FORMS |
2150-218F |
|
38 |
ARROWS |
2190-21FF |
|
39 |
MATHEMATICAL OPERATORS |
2200-22FF |
|
40 |
MISCELLANEOUS TECHNICAL |
2300-23FF |
|
41 |
CONTROL PICTURES |
2400-243F |
|
42 |
OPTICAL CHARACTER RECOGNITION |
2440-245F |
|
43 |
ENCLOSED ALPHANUMERICS |
2460-24FF |
|
44 |
BOX DRAWING |
2500-257F |
|
45 |
BLOCK ELEMENTS |
2580-259F |
|
46 |
GEOMETRIC SHAPES |
25A0-25FF |
|
47 |
MISCELLANEOUS SYMBOLS |
2600-26FF |
|
48 |
DINGBATS |
2700-27BF |
|
(Reserved for future standardization) |
27C0-27FF |
||
80 |
BRAILLE PATTERNS (added by Amd.16) |
2800-28FF |
|
(7 more rows reserved for future standardization) |
2900-2FFF |
The next 30 rows contain alphabetic scripts and symbols that are used by languages that also make use of ideographic scripts. The reference to CJK in the titles of some of the blocks of these rows is to unified Chinese/Japanese/Korean characters; see the section on ideographic scripts for more information. The blocks and collections of these rows are given in Table 5.
Table 5 : Blocks and Collections of Rows 30-4D
(collection = block; * = contains combining characters)
49* |
CJK SYMBOLS AND PUNCTUATION |
3000-303F |
50* |
HIRAGANA |
3040-309F |
51 |
KATAKANA |
30A0-30FF |
52 |
BOPOMOFO |
3100-312F |
53 |
HANGUL COMPATIBILITY JAMO |
3130-318F |
54 |
CJK MISCELLANEOUS |
3190-319F |
55 |
ENCLOSED CJK LETTERS AND MONTHS |
3200-32FF |
56 |
CJK COMPATIBILITY |
3300-33FF |
81 |
CJK UNIFIED IDEOGRAPHS EXTENSION A (Amd.17) |
3400-4DBF |
(Reserved for future standardization) |
4DC0-4DFF |
The CJK COMPATIBILITY block includes many symbols for scientific units that have been coded in Chinese national standards as if they were ideographs. Examples, together with their coding, are
- mm³ (cubic millimetres)
SQUARE MM CUBED (coded at 33A3)
- µs (microsecond)
SQUARE MU S (coded at 33B2)
- rad/s² (radians per second per second, a unit of angular acceleration)
SQUARE RAD OVER S SQUARED (coded at 33AF)
The last 26 rows 34-4D of the A-Zone, now contain CJK Unified Ideographs Extension A (Amendment 17). However, these rows were allocated in the first edition of ISO/IEC 10646-1 to the Hangul syllabic script, divided into three blocks and corresponding collections numbered 57-59. Amendment 5 to this first edition deleted these allocations and created instead an allocation for a substantially larger set of Hangul syllabic characters in the O-zone. This was accepted as a violation of the principle that published allocations would not be changed, but there were compelling reasons to adopt this change. It will not be taken as a precedent for future changes of a similar nature.
4.4 Unified ideographs of the I-zone
The I-zone of the BMP is allocated as a single block to Chinese/Japanese/Korean unified ideographs, and it correspondingly forms a single collection. For completeness this is shown in the following table:
Table 6 : The one Block and Collection of the I-zone
60 |
CJK UNIFIED IDEOGRAPHS |
4E00-9FFF |
An informative annex S has been added to ISO/IEC 10646-1 by Amendment 8 which describes the unification procedure. This section of the guide is based on that annex.
The I-zone contains 20992 code positions, of which 20902 are currently allocated to specific ideographs. These ideographs were derived from over 54000 ideographs which are found in various different national and regional standards for coded character sets. A process of unification was applied in which single ideographs from two or more of the source standards were associated together and assigned to a single code position in the I-zone. The ideographs that are thus associated are described, for the purposes of the UCS, as unified. To preserve data integrity, any ideographs that are separately encoded in any one of the source standards were not unified. Also ideographs that are unrelated in historical derivation are not unified. However, some ideographs encoded in two different standards for the same language may have been unified.
The unification process is based on the shapes of the ideographs, analyzed according to a systematic procedure. Any ideograph is composed of geometric elements which may themselves be composite structures and possibly ideographs in their own right. This enables the structure of an ideograph to be described by a component tree, where the top node is the ideograph itself and the bottom nodes are primitive elements. When two ideographs are compared, their component trees are compared to see if they agree in all of the following aspects:
- the number of components;
- the relative position of the components in each complete ideograph;
- the structure of the corresponding components.
If all of these aspects agree then the ideographs are considered to have the same abstract shape and are therefore unified. Annex S to ISO/IEC 10646-1 contains a listing of pairs or triples of ideographs that would have been unified under these rules except for the criteria concerning historical derivation or separate encoding in an existing standard.
Unified ideographs are named and listed in the code pages of ISO/IEC 10646-1 in a manner separate from that used for other scripts. For each unified ideograph, the listing reproduces all (which may only be one) of the graphic symbols (source ideographs) that have been unified into that code position. For each graphic symbol it specifies the source standard from which the graphic symbol is taken and the coded representation of the symbol in that standard. The name assigned to each unified ideograph is algorithmically generated by appending their two-octet coded representation to “CJK UNIFIED IDEOGRAPH-“, for example CJK UNIFIED IDEOGRAPH-4E00.
The information concerning CJK unified ideographs has now been replaced by Amd.13.
4.5 The Hangul syllabics of the O-zone and Yi
Amendment 5 to the first edition of ISO/IEC 10646-1 specified a change in the encoding of Hangul syllabic script. Prior to that Amendment, the last 26 rows of the A-zone (row numbers 34-4D) were allocated to the Hangul syllabic script and the entire O-zone was reserved for future standardization. Due to a major revision of the corresponding Korean national standard shortly after the final text of the first edition was agreed, it became necessary to accommodate substantially more syllabic characters into the UCS. To include these additional characters, the total space required would be almost 44 rows.
It was decided that this was sufficient of an exceptional circumstance to merit violating the principle that code positions, once allocated, should not be changed. The Hangul syllabic characters already encoded would be moved from the A-zone to the I-zone, where there was sufficient space to include both the original and the additional characters in a single block, with a corresponding single collection. The amendment contains the statement that this change is not intended to be regarded as a precedent for other changes of allocation in future editions. This statement will itself be incorporated into future editions.
Amendment 14 has added the syllables and radicals of the Yi script to the O-Zone.
Following these amendments, the O-zone has the structure shown in the following table:
Table 7 : The Blocks and Collections of the O-zone
76 |
YI SYLLABLES |
A000-A48F |
77 |
YI RADICALS |
A490-A4CF |
(Reserved for future standardization) |
A4D0-ABFF |
|
71 |
HANGUL EXTENDED |
AC00-D7A3 |
(Reserved for future standardization) |
D7A4-D7FF |
Amendment 5 contains a mapping table giving the correspondence between the code positions before and after this amendment for the characters originally allocated to rows 34-4D.
The Hangul syllabic characters are assigned names that follow the naming rules used for alphabetic scripts, e.g. HANGUL SYLLABLE GEOLH (KEOLH) rather than the algorithmic name structure used for the CJK unified ideographs of the O-zone.
The R-zone is distinguished from the remainder of the BMP in that its code positions are allocated for use only in special circumstances. There are three distinct uses for the R-zone:
- Private use characters
- These may be specific user-defined characters or may be dynamically-redefinable characters. In either case an agreement is necessary between sender and recipient, outside the scope of ISO/IEC 10646, if these are to be exchanged meaningfully between two communicating parties.
- Presentation forms of characters
- A presentation form is an alternative form, for use in a particular context, to the nominal form of a character or sequence of characters from the other zones of graphic characters. The transformation from the nominal form to the presentation forms may involve substitution, superimposition or combination. The rules for such transformations are outside the scope of ISO/IEC 10646. Presentation forms are not normally intended to be used as a substitute for the nominal forms, but specific applications may use them in this way for particular purposes such as compatibility with existing devices. The specification of presentation forms within ISO/IEC 10646, an example of which is LATIN SMALL LIGATURE FI at code position FB01, blurs the distinction between characters and glyphs discussed elsewhere in this annex.
- Compatibility characters
- Compatibility characters are included in the UCS primarily for compatibility with existing coded character sets to allow two-way code conversion without loss of information.
As with the other zones, it is divided into blocks and collections but the block for private use consists, by its very nature, only of unallocated code positions. The structure of this zone is as follows:
Table 8 : The Blocks and Collections of the R-zone
(collection = block; *, = contains combining characters)
61 |
PRIVATE USE AREA |
E000-F8FF |
62 |
CJK COMPATIBILITY IDEOGRAPHS |
F900-FAFF |
63* |
ALPHABETIC PRESENTATION FORMS |
FB00-FB4F |
64 |
ARABIC PRESENTATION FORMS-A |
FB50-FDFF |
(Reserved for future standardization) |
FE00-FE1F |
|
65 |
COMBINING HALF MARKS |
FE20-FE2F |
66 |
CJK COMPATIBILITY FORMS |
FE30-FE4F |
67 |
SMALL FORM VARIANTS |
FE50-FE6F |
68 |
ARABIC PRESENTATION FORMS-B |
FE70-FEFE |
(The single character at code position FEFF is not in any of the blocks into which the BMP is divided. Its significance is explained in the chapter of this guide on Serial Transmission of the UCS) |
FEFF |
|
69 |
HALFWIDTH AND FULLWIDTH FORMS |
FF00-FFEF |
70 |
SPECIALS |
FFF0-FFFD |
Recall that the final two positions FFFE, FFFF are required to be left unused in every plane of the UCS. The collection numbered 200 is one of a number of special-purpose collections that have been assigned numbers in the range 200-299. See section 7 of this guide on repertoires and subsets for more information.
5 Visual representation of characters
5.1 Combining and non-combining characters
The concept of a combining character was discussed above in the section of this guide on precomposed and decomposed characters. This section describes their use in more detail.
The definition given in ISO/IEC 10646-1 is:
- combining character: A member of an identified subset of the coded character set of ISO/IEC 10646 intended for combination with the preceding non-combining character, or with a sequence of combining characters preceded by a non-combining character.
The fact that combining characters are coded following the base non-combining character separates them in principle from the non-spacing diacritical marks used for coding purposes in the variable-length code of ISO/IEC 6937. The nature of a non-spacing symbol, based on the capabilities of electromechanical printing devices, requires that it is received and printed by such a device before the base character with which it is combined. Combining characters in the UCS behave in a more intuitive way; it is more natural to apply diacritical marks to a character that is already known or written than to one that is to be notified later on.
Combining characters are not an essential part of the coding of the Latin script. They are modifiers of letters and all combinations of base letter and diacritical mark in normal use in languages written in the Latin script are separately coded as precomposed characters. Some examples which illustrate the level to which the coding of precomposed Latin characters has been taken are:
- 0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
- 0171 LATIN SMALL LETTER U WITH DOUBLE ACUTE
- 01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
- 1E1C LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE
However, diacritical marks are also used in the Latin script for reasons other than as part of normal spelling, for example to indicate stress positions in pronunciation. This may require combinations of letter and mark that are not used in normal language. The entire range of diacritical marks used with Latin (and Greek) scripts is available in collection 7, COMBINING DIACRITICAL MARKS.
A similar situation holds for the Greek script, both for its monotonic and polytonic forms. However, many other scripts such as Arabic and the Indic scripts (Devanagari, Gujarati, etc.) are written in such a manner that combining characters are an essential part of coding. The Indic scripts, for example, represent vowels by combining marks. The Cyrillic script falls between these two situations; whether combining characters are or are not essential depends on the language being represented.
The use of combining characters can add significantly to the difficulties of processing encoded text. For this reason, ISO/IEC 10646-1 defines three distinct levels of implementation in which either none (level 1), or some (level 2), or all (level 3) of the combining characters of the UCS are permitted to be encoded. These levels are described in more detail elsewhere in this annex. At any specified level of implementation, the defined character collections of the UCS have to be interpreted as if any combining characters whose use is not permitted at that level have been removed.
These differences between scripts lead to three distinct situations regarding the character collections of the UCS:
- Collections not containing combining characters at any level of implementation
- These are the collections not marked with either * or in the tables given above. They include all the collections of graphic characters for the Latin and Greek scripts and also the Basic Hebrew collection. Text requiring only these collections can be encoded at implementation level 1.
- Collections containing combining characters when used at level 3 but not at levels 1 or 2
- These are the collections marked with in the tables given above. At present there are four collections of this nature, but three of them (those containing the word COMBINING in their name) are empty at levels 1 and 2. The only collection of this nature available for use at levels 1 and 2 is the CYRILLIC collection. Much text written in the Cyrillic script does not make use of the combining characters and so can be encoded at implementation level 1. When they are required, it is necessary to use level 3.
- Collections containing combining characters when used at either of levels 2 and 3
- These are the collections marked with * in the tables given above. They include the HEBREW EXTENDED collection and all the collections for the Indic scripts. Much text requiring these collections can be encoded at level 2 but use of level 1 is not normally practicable.
It is the general intention of the UCS that for most purposes it will not be necessary to use a level 3 implementation but that the choice between levels 1 and 2 will depend on the character collections to be used.
The definitions clause of ISO/IEC 10646-1 includes the following terms:
- composite sequence: A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters.
- graphic symbol: The visual representation of a graphic character or of a composite sequence.
- repertoire: A specified set of characters that are represented in a coded character set.
It also contains the following notes:
- A graphic symbol for a composite sequence generally consists of the combination of the graphic symbols of each character in the sequence.
- A composite sequence is not a character and therefore is not a member of the repertoire of ISO/IEC 10646.
The term glyph is not used in ISO/IEC 10646 but its use enables us to divide the formation of a graphic symbol into two stages:
- The non-combining character or composite sequence is mapped to a glyph, which is an abstract graphic symbol;
- An image of the glyph is then formed according to some presentation process, to form a real graphic symbol.
This division separates all aspects such as the font to be used, its size and emphasis, etc. from the selection of the appropriate glyph. The process of creating the image of the appropriate glyph will in general be very complex, but it is entirely separate from the process of selecting the appropriate glyph to represent the character data. The mapping from the character data (single character or composite sequence) to the glyph is from one abstract entity to another, devoid of all complications arising from the actual presentation process.
The definitions quoted above give rise to the following features of the mapping from character data to glyphs:
- A single non-combining character (which is in the repertoire of the UCS) and a composite sequence (which is not in the repertoire) may map to the same glyph, e.g. the character LATIN SMALL LETTER E WITH ACUTE and the composite sequence LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT.
- A composite sequence may map to a glyph which represents a character that is not present in the UCS, e.g. LATIN SMALL LETTER G followed by COMBINING GRAVE ACCENT; there is no character LATIN SMALL LETTER G WITH GRAVE in the UCS, although LATIN SMALL LETTER G WITH ACUTE is present. Characters which can be represented in this way by the UCS, but which are not in the UCS, are not considered to be part of the repertoire of the UCS.
- A given piece of text may have many equally valid representations by a string of characters, depending on whether precomposed or decomposed characters are used.
- As there are no restrictions on the use of combining characters (in the levels of implementation at which they are permitted at all), many composite sequences will not map to any meaningful glyph. This applies in particular to composite sequences in which non-combining characters from one script are followed from combining characters from another script. This is not forbidden but it is unlikely to be meaningful.
These features make the UCS very different from ISO/IEC 6937, despite the similarity at first sight between the combining characters of the UCS and the non-spacing diacritical marks of ISO/IEC 6937. This latter standard has the following features which may be compared with the list given above for the UCS:
- The letters which can be formed from a base letter preceded by a non-spacing diacritical mark are part of the repertoire of ISO/IEC 6937.
- A given piece of text has a unique coding in terms of ISO/IEC 6937 since accented characters are available only in decomposed form (i.e. coded with the aid of non-spacing diacritical marks).
- There is a normative list of the characters comprising the repertoire of the standard, so constructions that have no meaningful glyph, such as NON-SPACING GRAVE ACCENT followed by POUND SIGN, are not conformant to ISO/IEC 6937. Unfortunately this has resulted in certain combinations of grave and acute accents, and of diaeresis, with the letters W and Y, used in the Welsh language, being absent from the repertoire of that standard even though they have natural representations in terms of the available non-spacing diacritical marks.
5.3 Use of multiple combining characters
The UCS permits more than one combining character to be applied to a single non-combining character. When this occurs, it lays down rules for their interpretation. In outline, these are as follows:
- When two combining characters can interact, then by default they are to be positioned in the sequence in which they are received, working from the base character outward. For example, if a combining macron is followed by a combining diaeresis then the diaeresis would appear above the macron, but if the order were reversed then the macron would be above the diaeresis.
- Some specific combining characters override the default behaviour by being positioned horizontally, rather than vertically, in relation to one another or by forming a ligature with an adjacent combining character. In this case they are positioned in the sequence in which they are received, working in the same direction (left-to-right or right-to-left) as the script is written.
- When two combining characters do not interact, such as when one is positioned above the base character and the other below it, then the order in which they are coded does not affect the resulting glyph.
6.1 Identification of characters for migration to the UCS
Since a character is an abstract concept, the question of whether two characters are or are not “the same” is not a trivial one. If we recall the definition:
- character: A member of a set of elements used for the organisation, control, or representation of data (taken from ISO/IEC 10646-1:1993)
then there is no problem within a single coded character set, since any such set must clearly specify its members. But frequently, in the transmission or processing of character data, that data needs to be converted from one coded representation to another. This is a particular problem in the migration of existing applications from other character set codes to the UCS. The question then arises of identifying “the same character” in two different character sets.
One cannot just look at the characters, in a visual representation, since distinct characters may have the same glyph. The question should be whether they are both used in the same way in the organisation, control or representation of data. But the specification of a coded character set does not specify how its characters should be used; that is outside of its scope. It merely makes the characters available for use. The “sameness” of characters from different coded character sets is therefore ultimately a matter of convention or of definition.
One of the largest resources of coded character sets is the International Register of Coded Character Sets for use with Escape Sequences, maintained and published by the Registration Authority for ISO 2375 in accordance with the procedures of that standard. Those procedures specify how to compare two coded character sets, as follows. Two sets are deemed to be identical if
- the number of characters is the same;
- the names of the characters are the same according to the terminology of the Registration Authority;
- the same positions are used for the same characters;
- both sets are of the same type, in particular both a C0 or a C1 set;
- the definitions of control characters are functionally equivalent (a more restricted definition is not considered equivalent);
- graphic characters have the same geometric shape apart from aesthetic variations between fonts;
- any non-spacing graphic characters are in the same positions.
If we abstract from this those aspects which compare individual characters, rather than their code positions or overall aspects of the complete set, we see that two graphic characters are regarded as identical if
- they have the same name according to the terminology of the Registration Authority;
- they may be represented by the same glyph;
- for characters intended for combining with other characters then the rules for creating combinations are the same (in ISO 2375 the only recognized form of combination is the use of non-spacing characters).
The first of these requirements permits the Registration Authority to change the name (for example, from that used in the source standard whose code is being registered) to bring it into a standardized form. It is the policy of SC2, the ISO/IEC JTC1 sub-committee responsible for coded character set standards, to align the names of characters in its published standards with those used in ISO/IEC 10646. When necessary, renaming will take place when standards are next revised. Such renaming will ensure that two characters are given distinct names if they have distinct glyphs or distinct combining procedures. It follows that
- two graphic characters, from different coded character sets, should be regarded as the same if they have the same name according to the character naming guidelines of ISO/IEC 10646.
6.2 Naming guidelines of the UCS
The naming guidelines of the UCS are given in annex K of ISO/IEC 10646. They include the following:
- By convention, only Latin capital letters A to Z, space, and hyphen shall be used for writing the names of characters.
NOTE – Names of ideographic characters may also include digits 0 to 9 provided that a digit is not the first character in a word.
- In some cases, the name of a character can be followed by an additional explanatory statements not part of the name. These statements shall be in parentheses and not in capital Latin letters except the initials of the word where required.
- The name of a character shall wherever possible denote its customary meaning, for examples PLUS SIGN. Where this is not possible, names should describe shapes, not usage; for example: UPWARDS ARROW.
- The names shall be constructed from an appropriate set of the applicable terms of the following grid and ordered in the sequence of this grid
-
- Script (e.g. Latin, Cyrillic, Arabic – letters that are elements of more than one script are considered different even if their shape is the same. This is not necessarily true of non-letter characters such as punctuation marks, even where the usage differs between scripts. In such cases the name reflects the most customary use, with alternative names in parentheses)
- Case (e.g. capital, small).
- Type (e.g. letter, ligature, digit).
- Language (e.g. Ukrainian – only included to remove ambiguity, such as between CYRILLIC CAPITAL LETTER I and CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I, which are distinguished by having different glyphs).
- Attribute (e.g. final, sharp, subscript, vulgar).
- Designation (e.g. customary name, name of letter – the names of letters of all scripts other than Latin are represented by their transcription in the language of the first published International Standard).
- Mark(s) (e.g. acute, ogonek, ring above, diaeresis).
- Qualifier (e.g. sign, symbol).
- There are exceptions to the above rules. Traditional names such as APOSTROPHE and COMMERCIAL AT, shall be acceptable as names and alphanumeric identifiers shall be used for ideographic characters.
Not all of the eight terms in this numbered list need be present. Examples of character names, with term numbers added after each name element, are
- LATIN (1) SMALL (2) LETTER (3) DOTLESS (5) J (6) WITH STROKE (7)
- DIGIT (3) FIVE (6)
- LEFT (5) CURLY (5) BRACKET (6)
- COMBINING (5) ACUTE (7) ACCENT (8)
These guidelines are sufficiently clear that there are very few cases in which it is unclear whether two characters from different coded character sets should have the same name under them. Here are some examples of naming problems.
- The changes in Technical Corrigendum 1 concerning ® were said to be changes of name, for example from LATIN CAPITAL LIGATURE AE to LATIN CAPITAL LETTER AE.
Both names are constructed according to the naming guidelines and they differ in one name element, namely Type. They therefore are names for two different characters with the same glyph. The Technical Corrigendum changed the characters allocated to the six code positions affected, it did not rename six characters. This was, however, a correction of an editorial error and not a change of coding from the intentions of the first edition.
- The second edition (1994) of ISO/IEC 6937 contains a character MUSIC NOTE (with a musical quaver as the illustrative glyph). This is not present in the UCS, but that does contain the two characters QUARTER NOTE and EIGHTH NOTE, which have more specific names and illustrative glyphs of a musical crotchet and quaver respectively. Comparison of the glyphs in the two standards shows that MUSIC NOTE should be treated as the character EIGHTH NOTE, not as QUARTER NOTE.
- The first edition (1983) of ISO/IEC 6937, which preceded current naming guidelines, contained a character with the name “small letter g with acute accent”. In 8.3 of the second edition it states that this character has been renamed as LATIN SMALL LETTER G WITH CEDILLA in order to align with ISO/IEC 10367 (the cedilla being placed above the g for presentation purposes). However, the UCS contains both LATIN SMALL LETTER G WITH CEDILLA (in the collection LATIN EXTENDED-A) and LATIN SMALL LETTER G WITH ACUTE (in the collection LATIN EXTENDED-B). The justification for the name change is that the original name was in error; the character concerned was always intended to be the small letter corresponding to “capital letter g with cedilla” but was named erroneously due to the positioning of the diacritical mark.
6.3 Linguistic translation of character names
Because of the significance of the names of characters in constructing correspondences between the UCS and other coded character sets, it has been controversial within the relevant sub-committee ISO/IEC JTC1/SC2 as to whether the names of characters may be translated when the text of ISO/IEC 10646-1 is translated into another language. It has recently been agreed that the names of characters may be translated.
One effect of this decision is that names will no longer serve as language-independent unique identifiers of characters. They retain their central role in determining whether characters from different coded character sets are or are not the same, but the comparison of names must take place in a common language.
6.4 Unique identifiers for characters
If names of characters are to be translatable, there becomes a need for some other form of unique identifier for characters that is language independent. Since the aim of the UCS is to include all the world’s characters, this enables the coding of a character in the UCS to be used as an identifier of that character in all situations, including in the specification of other coded character sets. Such a scheme would solve, for the future, the problem of comparing characters from different coded character sets. However, in order to add such identifiers to existing character sets as they are revised, it is first necessary to create a correspondence between the set concerned and the UCS by means of names as described above.
Amendment 9 to ISO/IEC 10646-1 proposes several alternative forms for unique identifiers constructed from UCS code positions. These have the following constructions, in which hhhhhhhh represents the eight hexadecimal digits that represent the code position in the UCS and kkkk represents the last four of these digits for characters of the Basic Multilingual Plane (BMP):
- hhhhhhhh or -hhhhhhhh or T-hhhhhhhh or U-hhhhhhhh;
- kkkk or +kkkk or T+kkkk or U+kkkk.
The significance of the optional prefixes is as follows:
- a minus sign indicates that the numeric form is the eight-digit form, a plus sign indicates that it is the four-digit form;
- a letter T indicates that the identifier refers to the character at the specified code position before the application of Amendment 5 to the first edition of ISO/IEC 10646-1, this amendment being the reallocation of Hangul syllabic characters from the A-zone to the O-zone;
- a letter U indicates that the identifier refers to the character at the specified code position after this application of Amendment 5.
If there is no prefix letter then the relevant amendment level is unspecified. The three forms (no prefix letter, T prefix, U prefix) coincide unless hhhhhhhh lies in the range 00003400 to 00004DFF inclusive. For this range, the correspondence between the T and U forms is given by the mapping table in the annex to Amd.5. As an example:
- T+340F and U+AC19 identify the same character.
The prefix letters, and the letters A to F used as hexadecimal letters, may be written either as capital letters or as small letters.
6.5 Unique identifiers for glyphs
The unique identifiers described above for characters are based on the International Standard ISO/IEC 10646. There is also an internationally agreed assignment of unique identifiers to glyphs, but this is instead based on an International Registration Authority. The registrar is the Association for Font Information Interchange and the register operates under procedures laid down in ISO/IEC 10036.
Glyphs registered under ISO/IEC 10036 are assigned an identifier by the Registration Authority that is a hexadecimal number in the range from 0 to FFFFFFFF. This is the same range of values as that used for identifiers of characters in accordance with ISO/IEC 10646. For the characters of ASCII the same value has been assigned to one possible glyph for each character as is assigned to the character in the ASCII code, and therefore as also in the UCS. For example, the character LATIN CAPITAL LETTER A has the character identifier U+0041 and is represented by the glyph “A” which has the glyph identifier 41 (hexadecimal). However, certain characters of the ASCII code have had their interpretation refined as coded character sets have developed over time. This has led to departures from a strict correspondence even for the ASCII code. In particular:
- U+0060 is the character GRAVE ACCENT but glyph identifier 0060 is a left single quotation mark (the glyph identifier for a grave accent is 00C1).
The use of code positions 27 (U+0027 is APOSTROPHE) and 60 for right and left single quotation marks was an allowed alternative in the original ASCII code. The glyph for a right single quotation mark is acceptable also for an apostrophe, but that for a left single quotation mark is not acceptable as a grave accent. These ASCII alternatives are still present in the registration entry under ISO 2375 for the ASCII code, namely ISO IR-6 in the International Register of Coded Character Sets to be used with Escape Sequences, as this entry dates from 1975. Register entries, once made, cannot be revised (other than in exceptional circumstances and if the possibility of revision was stated in the original entry). However, these alternatives are not present in the international standard equivalent to ASCII, namely the International Reference Version (IRV) of ISO/IEC 646:1991. Nevertheless, that standard states explicitly that its IRV may be identified as ISO IR-6.
For use in a wider context, ISO/IEC 9541-1 specifies a structured-name form for the identification of glyphs registered under ISO/IEC 10036. These have the form
- ISO/IEC 10036/RA/Glyphs::nnnn
where nnnn is a sequence of decimal digits, beginning with a non-zero digit, which represents the hexadecimal value of the glyph identifier assigned by the Registration Authority. The concept of a structured-name is specified normatively in ISO/IEC 9541-2, which gives both ASN.1 and SGML forms for such names.
There is really only one concept of a repertoire, namely a repertoire is a specified set of characters. However, the concept is defined slightly differently in different character set standards and it is interpreted in ways that may differ from one’s expectations. Two particular definitions are
- repertoire: A specified set of characters that are represented by one or more bit combinations of a coded character set. [ISO/IEC 6937]
- repertoire: A specified set of characters that are represented in a coded character set. [ISO/IEC 10646-1]
For completeness, a coded character set also has a formal definition
- coded character set: A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations. [ISO/IEC 6937]
- coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. [ISO/IEC 10646-1]
It is instructive to see how these two standards differ in their use of the concept of repertoire. Recall that ISO/IEC 6937 is a standard that bases a variable-length encoding of characters from the Latin script on forming combinations of non-spacing diacritical marks with unaccented letters. It is based on two separate 7-bit coded character sets that are separately registered under ISO 2375. The primary set of ISO/IEC 6937 is the left-hand set, coded in an 8-bit code as 20 to 7E This is precisely the ASCII set registered as ISO-IR 6. The supplementary set of ISO/IEC 6937 is the right-hand set, coded as A0 to FF, which contains both the non-spacing diacritical marks and other (spacing) characters.
The repertoire of ISO/IEC 6937 is specified separately, as a list of characters together with their (variable length) coded representations. It consists of 333 characters, including SPACE. Its characters include the accented characters that are coded by two octets, the first representing a non-spacing diacritical mark from the supplementary set and the second representing an unaccented letter from the primary set. The repertoire of ISO/IEC 6937 does not include the non-spacing diacritical marks as characters in their own right.
This is entirely consistent with the definition of a repertoire. The repertoire of ISO/IEC 6937 is established by that standard as a specific list of characters, each of which is represented by one or more bit combinations. It is quite separate from the union of the repertoires of the primary and supplementary sets of ISO/IEC 6937, which consists of the 191 characters, including SPACE, each coded by one octet. That repertoire does include, say, NON-SPACING ACUTE ACCENT, but it does not include LATIN SMALL LETTER E WITH ACUTE, while the repertoire of ISO/IEC 6937 includes the latter character but not the former one.
The concept of repertoire as used in ISO/IEC 10646 corresponds in the context of ISO/IEC 6937 to that of the union of its primary and supplementary sets, not to that of ISO/IEC 6937 itself. The repertoire of ISO/IEC 10646 consists of the characters that are assigned to code positions within the 31-bit coding space of the UCS. It therefore includes combining characters (which are the nearest equivalent in ISO/IEC 10646 to the non-spacing diacritical marks of ISO/IEC 6937) but does not include either composite sequences or characters, such as LATIN SMALL LETTER G WITH GRAVE, which have glyphs that can be represented by composite sequences.
There is a faint indication of this difference in the definitions given in these two standards. In ISO/IEC 6937 the definition refers to characters represented by bit combinations; in ISO/IEC 10646-1 it refers to characters represented in the coded character set. There is no conflict, since it is the definition of a coded character set that is crucial. A coded character set is first required to establish a character set, before it assigns coding. That character set is then the repertoire of the coded character set. A repertoire, composed of characters, is therefore whatever the relevant standard says it is. It is, in principle, quite distinct from the set of glyphs that may be represented by the characters of the repertoire. For many purposes it is this set of glyphs that is relevant, not the set of characters used to represent them. But describing or specifying this set of glyphs is outside of the scope of standards for coded character sets.
7.2 Levels of implementation of the UCS
There are three levels of implementation specified in ISO/IEC 10646, distinguished from one another by limitations on the characters that may be encoded at the level concerned. They are as follows:
- Implementation level 1
- At level 1, the prohibited characters are those from the HANGUL JAMO block and all combining characters.
- Implementation level 2
- At level 2, the prohibited characters are those from the HANGUL JAMO block and a specific subset of combining characters listed in annex B of the standard.
- Implementation level 3
- At level 3, there are no restrictions on the characters that may be used.
Hangul Jamo characters are used in the Hangul syllable composition method. A sequence of two or three Hangul Jamo characters has a glyph that represents a syllable. Hangul syllables also have precomposed coding in the HANGUL EXTENDED block of the I-zone of the BMP. The relationship between coding in terms of Hangul Jamo and that as a single syllabic character is similar to that between the precomposed and decomposed forms of Latin characters with diacritical marks. However, there is no distinction for the Hangul Jamo characters corresponding to that between the non-combining and combining characters of a composite sequence. No Hangul Jamo characters have a meaning in isolation within the Hangul script. For this reason it is specifically stated that the characters of the HANGUL JAMO block are not combining characters. Note that the Hangul syllabic characters of the HANGUL EXTENDED block are permitted at all levels of implementation.
The chapter on visual representation of characters gives more information about the scripts that can be represented at the different levels of implementation.
A collection of characters consists of the characters of the UCS that are allocated to code positions lying within one of the ranges specified for this purpose in annex A of ISO/IEC 10646-1. Each collection is assigned both a number and a name. There is a collection associated with, and frequently identical to, each block into which the BMP is divided. These collections, together with their names and numbers, are listed in the chapter of this guide on the Basic Multilingual Plane (BMP). It should be noted that, as a collection is defined by a range, it may include code positions which have not been assigned characters. An amendment to the standard may allocate characters to such code points. Thus the repertoire defined in a collection may change over time. This is not always desirable, so the notion of a fixed collection was introduced in Corrigendum No.2. As a consequence the definition of a fixed collection has to be much more precise in that no range can contain unassigned code points.
Two different collections of characters may overlap, but of those associated with specific blocks the only overlap is that two of the four characters comprising the collection ZERO-WIDTH BOUNDARY INDICATORS are also present in collections for a number of specific scripts. A number of other specialized collections are defined in annex A which put together selections of characters that are also present in other collections. These consist of script-specific formatting characters and alternate forms. There are also two collections related to the permitted levels of implementation. One consists of all combining characters and the other of those combining characters that are not permitted in an implementation at level 2. Finally there are five large collections (two of which are fixed collections) defined as follows:
Table 9 : Large collections of the UCS
299 |
BMP FIRST EDITION Note: a fixed collection containing only characters contained in the first edition prior to any amendments. |
See ISO/IEC 10646-1 A.3 |
300 |
BMP |
0000-D7FF, E000-FFFD |
301 |
BMP-AMD.7 Note: a fixed collection containing those characters of the first edition as amended by amendments 1 to 7. |
See ISO/IEC 10646-1 A.3 |
400 |
PRIVATE USE PLANES |
G=00, P=0F, 10, E0-FF |
500 |
PRIVATE USE GROUPS |
G=60-7F |
The specifications of collections 300 and 400 were changed by Amendment 1 consequent on the introduction of the S-zone and its reservation for the use of UCS Transformation Format 16.
A subset is a more general term that refers to any identified set of characters from the entire repertoire of the UCS. Two alternative means of specifying subsets are recognized within ISO/IEC 10646-1:
- Limited subset
- A limited subset is specified by giving explicitly a list of the graphic characters in the subset. They may be listed by their names or their code positions in the UCS.
- Selected subset
- A selected subset is specified as a list of collections. A selected subset shall always include the BASIC LATIN collection.
A selected subset is more restricted than a limited subset in its permitted content, but it has two great advantages. It is much more concise to list collections rather than individual characters. Also, annex M of ISO/IEC 10646-1 specifies by algorithm an ASN.1 object identifier that may be used to identify a selected subset of the UCS within any context in which OSI protocols are used.
A limited subset may be assigned an ASN.1 object identifier, but only by means outside the scope of ISO/IEC 10646-1. The following European pre-standard:
- ENV 1973:1995 Information technology – European Subsets of ISO/IEC 10646-1
contains the definition of a limited subset (the Minimum European Subset) and assigns an ASN.1 object identifier to it. It also describes a selected subset (the Extended European Subset) that has an ASN.1 object identifier assigned in accordance with the algorithm of ISO/IEC 10646-1.
7.4 Significance of subsets for conformance to the UCS
Because of the size and open-ended nature of the repertoire of the UCS, conformance to ISO/IEC 10646-1 does not require the ability to handle all of the characters in the repertoire. Instead, a claim of conformance for information interchange is required to identify:
- a specific method of coding (see the chapter on coding methods of the UCS);
- a specific subset of characters (see above);
- a specific level of implementation (see above);
A separate definition of conformance is given for conformance of a device. For this purpose a device is a component of information processing equipment which can transmit and/or receive coded information, such as an input/output device, an application program or a gateway function. A claim of conformance for a device is required to specify the above three items and in addition
- a specific selection of control functions that are used in conjunction with the UCS (see the chapter on control functions).
The precise meaning of conformance to ISO/IEC is specified in ISO/IEC 10646 and will not be reproduced here. The important aspect here is that conformance only requires support of the UCS within the limits determined by these specified items.
7.5 Subsets as an aid to migration from 8-bit codes
The ability to conform to ISO/IEC 10646 while supporting only a subset of its characters is a great aid to migration from other coded character sets. In particular it permits support to be developed collection by collection. It is only in a few cases that there is a direct correspondence between the collections defined in ISO/IEC 10646-1 and the repertoires of other standardized coded character sets. However, expansion of support one collection at a time eases substantially the effort required, such as the development glyphs for additional characters.
The assignment by ISO/IEC 10646 of an ASN.1 object identifier for any selected subset provides a means within OSI protocols for an application to notify its peer, in any communication, of the collections that it supports. The Extended European Subset (EES) specified in ENV 1973 consists of the collections numbered 1-11, 27-28, 30-48, 63, 65 and 70. These contain 4013 code positions, of which 3095 are currently assigned to characters. These are all the collections that contain characters of the Latin, Greek, Cyrillic, Armenian and Georgian scripts together with other characters of the International Phonetic Alphabet and a wide range of symbols used for academic, commercial and scientific purposes within Europe. This subset is defined as guidance for product developers, but it in no way restricts the ability of any developer to extend support to either a smaller or a larger range of collections than that of the EES.
At present there are four coding methods specified within ISO/IEC 10646-1, any one of which can be specified in a claim of conformance to that standard. These methods have been assigned acronyms for easy reference, as follows:
- UCS-2 is the Two-octet BMP form of coding;
- UCS-4 is the Four-octet canonical form of coding;
- UTF-16 is UCS Transformation Format 16, which was added to ISO/IEC 10646-1 by Amendment 1;
- UTF-8 is UCS Transformation Format 8, which was added to ISO/IEC 10646-1 by Amendment 2.
The first edition of ISO/IEC 10646-1 contained a specification of a transformation format UTF-1 but this was deleted from the standard by Amendment 4 and is not available as a coding method in a claim of conformance to ISO/IEC 10646-1.
The two-octet BMP form of coding permits the use of characters from the BMP with each character represented by two octets. The BMP is specified by the G-octet and P-octet both being 00. In this form of coding a character is represented by the R-octet and C-octet of its code position. When expressed as a four-digit hexadecimal number the R-octet gives the most significant two digits and the C-octet gives the least significant two digits. UCS-2 provides a fixed-length coding for all the characters of the BMP.
8.3 UCS-4: Four-octet canonical form
The four-octet canonical form permits the use of all characters of ISO/IEC 10646 with each character represented by the G-octet, P-octet, R-octet and C-octet of its code position. These are taken in decreasing order of significance in the expression of a code position as an eight-digit hexadecimal number. UCS-4 provides a fixed-length coding for all the characters of the UCS.
8.4 UTF-16: UCS Transformation format 16
Once characters start to be allocated outside of the BMP, it will no longer be possible to use UCS-2 to encode all the allocations that have been made. However, a transition to UCS-4 instantly halves the rate of transfer of data through a communication link or the amount that can be stored on a given storage medium. This effect occurs even if the transition to UCS-4 has been made in order to accommodate only very occasional characters coded outside of the BMP.
The transformation format UTF-16 has been designed to avoid this halving of capacity, by means of a variable-length coding. It provides a means of coding any character within the first 17 planes P=00-10 of Group 00 such that the coding of any character within the BMP (Plane 00) is unchanged from its UCS-2 form. This multiplies the number of available code positions by 17 when compared with the BMP, but the number of octets used for coding is increased only for the (occasional) characters that are allocated to the planes outside the BMP. The capacity of a transmission link or storage device will therefore be little affected.
This has been achieved by reserving the S-zone of the BMP, consisting of the 8 rows D8-DF, for the exclusive use of UTF-16. These R-octet values can therefore never occur in an encoding within UCS-2. They are used instead to provide an escape mechanism into the 16 planes G=00, P=01-10. Amendment 1, which specifies UTF-16, also amended the planes of Group 00 which are reserved for private use. It added planes P=0F, 10 to the planes P=E0-FF which were already reserved for this purpose in the first edition of ISO/IEC 10646-1. The effect of this change is to include two private use planes in the 16 additional planes accessible by use of UTF-16.
The UTF-16 coding for a character coded within Planes 00-10 of Group 00 is constructed as follows:
- If P=00 then the coding is in two octets and is as for UCS-2, i.e. the R-octet followed (i.e. with lower significance) by the C-octet;
- If P=01-10 then the coding is in four octets constructed from the UCS-2 coding of two code positions from the S-zone. The first (most significant) two octets are from the range D800-DBFF, the second (least significant) two octets are from the range DC00-DFFF. The code space P=01-10 is divided into blocks of 400 (hexadecimal value) cells for the purpose of determining the coding. The first two octets determine in which block the code position lies. The second two octets determine the position within the block.
In more detail the correspondences between the UCS code position and the pair of S-zone positions is as follows:
- The first two octets are D800 if P=01, R=00-3F; D801 if P=01, R=40-7F, É, D804 if P=02, R=00-3F, É, DBFF if P=10, R=C0-FF.
- The second two octets run from DC00 to DFFF as the position within the block of cells runs from the first to the last position.
The UTF-16 encoding D800 DC00 and the UCS-4 encoding 00010000 therefore represent the same character.
8.5 UTF-8: UCS Transformation format 8
The aim of UTF-8 is entirely different from that of UTF-16. The transformation format UTF-8 is intended for the transmission of data through communication systems which treat the data stream as a sequence of octets from a coding system conforming to the 8-bit code structure laid down in ISO/IEC 4873. This code structure is specific as to the interpretation of octet values in the range 00-7F but octets in the range 80-FF have a variable interpretation that requires agreement between the communicating parties. A communication channel expecting data to conform to ISO/IEC 4873 may therefore only presume to know the interpretation of octets 00-7F.
In particular the communication system may interpret any octet in the range 00-1F as a control character as specified in ISO/IEC 4873, any octet in the range 20-7E as the ASCII character with this coding, and octet 7F as the DELETE character. To comply with this, UTF-16 encodes BMP code positions 0000 – 007F inclusive by means of their final octet only. This range of positions includes those reserved for control characters and the DELETE function and it relies on the positioning of the ASCII graphic character set in positions 0020-007E of the BMP.
All other code positions in the UCS are represented in UTF-8 by a sequence of 2, 3, 4, 5 or 6 octets. The first octet of such a sequence is in the range C0-FD. Continuing octets are in the range 80-BF. Octets FE and FF are not used.
There is no concept of most significant and least significant octets in UTF-8 encoding. It is a conversion of UCS characters into an ordered sequence of octets, for transmission or other processing in this form. The terms first octet and continuing octets refer to the order in which the octets occur in the sequence. This order must be maintained, even in transmission systems which serialize 16-bit words as octets by sending the least significant octet before the most significant octet.
The details of the transformation from UCS-4 coding to UTF-8 coding are complex and are not given here in detail. The transformation is such that a code position within the BMP takes at most 3 octets and a code position in planes P=01-1F of Group 00 takes at most 4 octets. These positions that take a maximum of 4 octets to encode therefore include, and exceed, those that can be encoded within UTF-16.
9 Serial transmission of the UCS
For many purposes, character data encoded according to any of the encoding methods of the UCS will need to be transmitted in serial form as a sequence of octets. Each of the encoding methods of the UCS, other than UTF-8, specifies its encodings in terms of octets ranked from a most significant octet to a least significant octet. This corresponds to the representation of the code value as a hexadecimal number of 2, 4, 6, 8 or more hexadecimal digits. This description in terms of orders of significance is entirely separate from the order in which the octets are transmitted down a serial communication channel. UTF-8 coding is distinct in that it is directly a coding of UCS data as an ordered sequence of octets.
It is laid down in 6.3 of ISO/IEC 10646-1 that in any transmission method the sequence order of octets from most to least significant shall be preserved and the most significant and least significant ends of the sequence must be identifiable. Furthermore, when character data is serialized as octets then a more significant octet shall precede a less significant octet. When not serialized as octets, the order of octets is a matter of agreement between sender and recipient. For example, if UCS-4 encoding is transmitted along a 16-bit data path then the most and least significant 16-bit words must be composed respectively of the G and P octets and of the R and C octets but it is a matter of private agreement as to the ordering of the two octets within a 16-bit word. The use of signatures for coding identification, explained below, enables a receiving implementation to determine the octet ordering within a 16-bit word in such cases.
A serial data stream of octets may be transmitted through a data path that is 8 bits wide, but it may also be transmitted in turn as a serial stream of single bits. Such representation of the octet stream as a stream of single bits is outside the scope of ISO/IEC 10646-1, so there is no requirement placed by that standard on the order in which the individual bits of an octet are transmitted. The requirements concerning octet order apply at a higher level of protocol at which the data exists in the form of complete octets.
9.2 Signatures for coding identification
There is one character of the BMP that is not part of any of the blocks into which the BMP is otherwise divided. This is the character
- ZERO WIDTH NO-BREAK SPACE (U-0000FEFF)
It is not normally required for linguistic purposes and is not present in any of the collections associated with particular scripts. It is given a special significance, by a convention laid down in annex F of ISO/IEC 10646-1, that enables a receiving implementation to determine without prior knowledge what octet ordering is being adopted by the originating implementation. The coded form of this character, in each of the coding methods of the UCS, is known as the signature of that coding method.
Under this convention, an originating implementation sends the character ZERO WIDTH NO-BREAK SPACE at the beginning of a stream of characters. A receiving implementation that is unaware of the convention may simply ignore this character as it has no semantic meaning when sent as an initial character. But an implementation aware of the convention may use the received coded form to determine the ordering of octets being used by the sender.
The coded forms that may be received are:
- UCS-2 signature: FEFF
- UCS-4 signature: 0000 FEFF
- UTF-16 signature: FEFF
- UTF-8 signature: EF BB BF
The UTF-8 coding method is a special case, in that this method in itself specifies an octet ordering. The octets of the UTF-8 signature should therefore never be received in any other order. For the other three coded forms, if the data received is interpreted as containing the 16-bit word FFFE then it indicates that the order of octets within the word should be reversed. Recall that FFFE is prohibited from allocation to a character within any plane of the UCS, specifically to allow this method of signatures to operate without any possible ambiguity. If the word value FEFF (or FFFE) is preceded by the word value 0000 then it also serves to indicate that UCS-4 coding is being used. It is not possible to distinguish between UCS-2 and UTF-16 coding by signature alone, as they give identical encodings for every character of the BMP.
10 Use of control functions with the UCS
10.1 The coding of control functions in 7-bit and 8-bit codes
The general structure of 7-bit and 8-bit codes for character sets is given in:
- ISO/IEC 2022 Information technology – Character code structure and extension techniques.
In particular, this standard provides the overall rules governing the use of control functions with such character sets. When control functions are used in conjunction with the UCS, their coded representation is algorithmically derived from their ISO/IEC 2022 representations. In the first edition of ISO/IEC 10646-1 this algorithm was based exclusively on the ISO/IEC 2022 representation for a 7-bit code, but Amendment 3 changed this to permit also coded representation of control functions based on their 8-bit code representation.
The following definitions are taken from ISO/IEC 2022:
- control character: A control function the coded representation of which consists of a single bit combination.
- control function: An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more bit combinations.
In this context a bit combination is either a 7-bit or an 8-bit byte. In a 7-bit code the hexadecimal values 00-1F are reserved for control characters; this is known as the CL area of the code table. In an 8-bit code there are two such reserved areas, the CL area comprising values 00-1F and the CR area comprising 80-9F.
The coding of any control function either consists of a single control character from either the CL or CR areas, or such a character followed by one or more bit combinations. These following bit combinations are unrestricted in number and value. In particular, they may be (and usually are) values that on their own would be coded representations of graphic characters. The syntax for the use of any control character must provide a way of determining the end of the coding of each control function coded with its use.
ISO/IEC 2022 specifies the assignment of only one control character:
- ESCAPE (acronym ESC) is required to be coded at 1B in the CL area.
The semantics specified for this control character are simple – it causes the meaning of a limited number of bytes following it to be changed. A sequence consisting of the ESCAPE character and such following bytes is known as an escape sequence.
ISO/IEC 2022 requires an escape sequence to have the following structure:
- The first character is the ESCAPE character represented by byte value 1B;
- the last character is a Final Byte with value in the range 30-7E;
- between the first and last character there may be zero, one or more Intermediate Bytes, each with a value in the range 20-2F.
This structure enables the escape sequence to be delimited without knowledge of the semantics of the control functions represented by such sequences. ISO/IEC 2022 also specifies a more detailed structure for the use of escape sequences. These uses are all for code extension purposes, such as changing the sets of control characters or graphic characters that are in use, but that standard does not specify the semantics of individual escape sequences. These are subject to registration in accordance with:
- ISO 2375 Data processing – Procedure for registration of escape sequences.
The register set up in accordance with ISO 2375 is:
- International Register of Coded Character Sets to be used with Escape Sequences.
10.2 C0 and C1 sets of control characters
The ISO 2375 register includes escape sequences that invoke individual control functions, but it also includes escape sequences that invoke sets of control characters. Each such set is either a C0-set or a C1-set. One set of each type may be invoked simultaneously in either a 7-bit or an 8-bit code. In both cases a C0-set is mapped into the CL area of the code table. A C1-set may be mapped into the CR area of an 8-bit code, but an alternative coding is necessary in a 7-bit code and this alternative is also available for use with 8-bit codes.
This alternative coding of a C1-set is by means of escape sequences. Control characters that would be coded as values in the CR area are represented by a two-byte escape sequence consisting of ESCAPE followed by a Final Byte in the range 40-5F. These values 40-5F correspond sequentially to the values 80-9F of the CR area. This coding is known as the ESC Fe representation of the control character. This term is used in the first edition of ISO/IEC 10646-1 in its explanation of the coding of control functions, but the permitted coding of control functions was extended by Amendment 3 and references to ESC Fe sequences have been deleted.
Control functions for purposes other than code extension are specified in
- ISO/IEC 6429 Information technology – Control functions for coded character sets.
This standard provides a repertoire of control functions from which particular functions may be selected to meet particular needs. Conformance to ISO/IEC 6429 does not require any particular control functions to be supported. It does, however, require that when any of its control functions are used then they shall be encoded as specified in that standard, and that encodings so specified shall not be used for any other purpose.
A particular type of coding specified in ISO/IEC 6429 is a control sequence. Control sequences are similar in nature to the escape sequences of ISO/IEC 2022 but they are less restricted in their use. In particular they permit the coding of control functions that take one or more numeric values as parameters. Control sequences start with the C1 control character
- CONTROL SEQUENCE INTRODUCER (acronym CSI)
which may be coded either as 9B in the CR area or as the ESC Fe escape sequence ESC 5B. This character occupies the same relative location in a C1 set as does the ESCAPE character (1B) in a C0 set. A control sequence has the following structure:
- It starts with the CSI function in either of its two coded representations;
- the last character is a Final Byte with value in the range 40-7E;
- between the first and last character there may be zero, one or more Parameter Bytes each with a value in the range 30-3F, followed by zero, one or more Intermediate Bytes, each with a value in the range 20-2F.
The Parameter Bytes used to represent a sequence of numeric parameters consist of the decimal representation of those numbers, with digits 0-9 coded by hexadecimal values 30-39, any separator symbol within a parameter (e.g. decimal point) coded as 3A and with distinct parameters separated from one another by value 3B.
The Final Byte value 6D is reserved in ISO/IEC 6429 for use by ISO/IEC 10646-1, in which it is used to encode the control function IDENTIFY UNIVERSAL CHARACTER SUBSET.
10.3 The use of control functions with the UCS
The coded representation of a control function for use with the UCS is obtained very simply from its coded representation for use with an 8-bit code. For an 8-bit code the coded representation consists of a sequence of one or more 8-bit bytes, of which the first is in one of the ranges 00-1F or 80-9F. Each of these bytes is converted to a UCS code position by taking the byte value as the C-octet value and setting the G-octet, P-octet and R-octet all to zero. Each UCS code position is then encoded according to the adopted coding method of the UCS, namely one of UCS-2, UCS-4, UTF-8 and UTF-16. Since code positions 0000-001F and 0080-009F of the BMP are reserved for the use of control characters, this gives an unambiguous coded representation of any control function without conflicting with the coding of graphic characters in the UCS.
In the first edition of ISO/IEC 10646-1 there was an additional requirement that control characters from a C1-set should be represented as ESC Fe escape sequences before this algorithm is applied. This requirement was removed by Amendment 3. As an example the control function CONTROL SEQUENCE INTRODUCER, which has coded representation 9B in the CR area of an 8-bit code, would have coded representations in a UCS-2 coding of
- 001B 005B prior to Amendment 3;
- either 001B 005B or more simply 009B after Amendment 3.
10.4 Identification of UCS subsets by use of control functions
One particular control function is defined within ISO/IEC 10646-1 for identifying selected subsets of the UCS. This is coded by means of an ISO/IEC 6429 control sequence, using a Final Byte value reserved in that standard for the use of ISO/IEC 10646-1. The control function is
- IDENTIFY UNIVERSAL CHARACTER SUBSET (acronym IUCS)
It takes as parameters the collection numbers of the collections that comprise the selected subset. The ISO/IEC 6429 coded representation of this control function consists of :
- the control function CONTROL SEQUENCE INTRODUCER (CSI)
- followed by parameter bytes that encode the sequence of collection numbers
- followed by the Intermediate Byte 20
- followed by the (reserved) Final Byte 6E.
This coded representation is then converted to the form appropriate to the UCS coding method in use, following the transformation rules explained above.
As an example,
- the subset comprising the BASIC LATIN and LATIN-1 SUPPLEMENT collections
- which are collections numbered 1 and 2
- may be identified by the control function IUCS 1, 2
- which has ISO/IEC 6429 coded representation CSI 31 3B 32 20 6E
- and which has UCS-2 encoding 001B 005B 0031 003B 0032 0020 006E.
This UCS-2 coding uses the ESC Fe coding form for CSI; following Amendment 3 to ISO/IEC 10646-1 the code values 001B 005B can be replaced by the single code value 009B.
10.5 Invocation of the UCS from an 8-bit code
The character code structure and extension techniques specified in ISO/IEC 2022 includes provision for control functions that
- designate and invoke an identified coding system different from that of ISO/IEC 2022, and
- return to the coding system of ISO/IEC 2022 following such an invocation.
The coded representation of such a control function is by means of an escape sequence in which the first Intermediate Byte has value 25. If there is a second Intermediate Byte with value 2F, it signifies that the new coding system either has no means of return to ISO/IEC 2022 coding or that it provides its own means for so doing. If the second Intermediate Byte is either absent or has any other value then it signifies that the new coding system uses the escape sequence
- ESC 25 40
to return to the coding system of ISO/IEC 2022 and that on such return, the state of the coding system (which sets of control and graphic characters are invoked, etc.) is restored to the state at the time that the other coding system was invoked. This is known as the standard return.
When a new coding system is designated and invoked by means of an escape sequence that signifies that the invocation is without standard return, then any return to the coding system of ISO/IEC 2022 is (by 15.4.2 of ISO/IEC 2022) a return to that coding system in an unspecified state. No announcements, designations and invocations of sets of control and graphic characters, or of other features of ISO/IEC 2022, survive from the state in which the new coding system was invoked. This is true even if the new coding system in fact specifies that return shall be by means of the escape sequence ESC 25 40; it is the escape sequence used to invoke the new coding method, not that used to return from it, that determines the state of the ISO/IEC 2022 coding system upon return. This has particular significance for the UCS, for reasons that will be seen below.
The assignment of particular escape sequences of these forms to particular coding methods is subject to registration in accordance with ISO 2375. The escape sequences so allocated are then published in the International Register of Coded Character Sets to be used with Escape Sequences. The following escape sequences have been registered to designate and invoke the coding methods of the UCS. They are given along with their registration numbers:
- ISO-IR 162 allocates ESC 25 2F 40 to designate and invoke UCS-2 coding at implementation level 1 without standard return;
- ISO-IR 163 allocates ESC 25 2F 41 to designate and invoke UCS-4 coding at implementation level 1 without standard return;
- ISO-IR 174 allocates ESC 25 2F 43 to designate and invoke UCS-2 coding at implementation level 2 without standard return;
- ISO-IR 175 allocates ESC 25 2F 44 to designate and invoke UCS-4 coding at implementation level 2 without standard return;
- ISO-IR 176 allocates ESC 25 2F 45 to designate and invoke UCS-2 coding at implementation level 3 without standard return;
- ISO-IR 177 allocates ESC 25 2F 46 to designate and invoke UCS-4 coding at implementation level 3 without standard return;
- ISO-IR 178 allocates ESC 25 42 to designate and invoke UTF-1 coding (now withdrawn from ISO/IEC 10646-1 by Amendment 4) with standard return – the register entry includes the complete specification of UTF-1;
- ISO-IR 190 allocates ESC 25 2F 47 to designate and invoke UTF-8 coding at implementation level 1 without standard return;
- ISO-IR 191 allocates ESC 25 2F 48 to designate and invoke UTF-8 coding at implementation level 2 without standard return;
- ISO-IR 192 allocates ESC 25 2F 49 to designate and invoke UTF-8 coding at implementation level 3 without standard return;
- ISO-IR 193 allocates ESC 25 2F 4A to designate and invoke UTF-16 coding at implementation level 1 without standard return;
- ISO-IR 194 allocates ESC 25 2F 4B to designate and invoke UTF-16 coding at implementation level 2 without standard return;
- ISO-IR 195 allocates ESC 25 2F 4C to designate and invoke UTF-16 coding at implementation level 3 without standard return;
- ISO-IR 196 allocates ESC 25 47 to designate and invoke UTF-8 coding at an unspecified implementation level but with standard return.
Note that of the coding methods defined for the UCS, only UTF-8 and (the now withdrawn) UTF-1 have the ability to make use of the standard return, since they are the only coding methods in which the escape sequence for the standard return is not transformed when used with the coding method concerned.
ISO/IEC 10646-1 does specify a means to return to the ISO/IEC 2022 coding system when it has been invoked without standard return. The specified method is by means of the escape sequence ESC 25 40 transformed to the appropriate coding method of the UCS as described above, e.g. 001B 0025 0040 designates a return from UCS-2. This is the same escape sequence as that used for a standard return, but except when used with UTF-8 its coded representation will include additional padding octets with value zero. Even when used with UTF-8, when that coding method has been invoked by means of ISO-IR 190, 191 or 192, it is interpreted for the purposes of ISO/IEC 2022 as a non-standard return, with consequent loss of the state of that coding system as described above.