'cmap'
tableThe 'cmap'
table maps character codes to glyph indices. The choice of encoding for a particular font is dependent upon the conventions used by the intended platform. A font intended to run on multiple platforms with different encoding conventions will require multiple encoding tables. As a result, the 'cmap'
table may contain multiple subtables, one for each supported encoding scheme.
Character codes that do not correspond to any glyph in the font should be mapped to glyph index 0. At this location in the font there must be a special glyph representing a missing character, typically a box. No character code should be mapped to glyph index -1, which is a special value reserved in processing to indicate the position of a glyph deleted from the glyph stream.
The 'cmap'
table begins with an index containing the table version number followed by the number of encoding tables. The encoding subtables follow.
The original definition of the 'cmap'
table only allowed for mappings from traditional character set standards, which used eight, a mixture of eight and sixteen, or sixteen bits for each character. With the introduction of ISO/IEC 10646-1 and the use of surrogates in versions of Unicode from 2.0 onwards, it is possible that fonts may require references to data that uses a mixture of sixteen and thirty-two or thirty-two bits per character.
It was originally suggested that a version number of 0 is used to indicate that only encoding subtables of types 0 through 6 are present in the 'cmap'
table. If the 'cmap'
table contains encoding subtables of types 8.0 or higher, the version number would then be set to 1. These latter encoding subtable types have been introduced to provide better support for Unicode text encoded using surrogates.
This suggestion is now dropped. All 'cmap'
tables should set the version number to 0.
Table 6: The 'cmap'
index
Type | Name | Description |
---|---|---|
UInt16 | version | Version number (Set to zero) |
UInt16 | numberSubtables | Number of encoding subtables |
'cmap'
encoding subtablesEach 'cmap'
encoding subtable begins with a platformID which specifies the environment in which the encoding will be used. The platformSpecificID follows. This identifies the particular encoding chosen among the possible alternatives for the specified platform. For example, MacRoman is one of several possible Mac OS standard encoding schemes. A list of standard platform identifiers and platform specific identifiers can be found in the section on the 'name'
table. The third entry is the offset of the actual mapping table.
Table 7: 'cmap'
encoding subtable
Type | Name | Description |
---|---|---|
UInt16 | platformID | Platform identifier |
UInt16 | platformSpecificID | Platform-specific encoding identifier |
UInt32 | offset | Offset of the mapping table |
The 'cmap'
encoding subtables must be sorted first in ascending by platform identifier and then by platform-specific encoding identifier.
Each 'cmap'
subtable is in one of seven currently available formats. These are format 0, format 2, format 4, format 6, format 8.0, format 10.0, and format 12.0 described in the next section.
'cmap'
formatsThe Macintosh standard character to glyph mapping is supported by format 0. Format 2 supports a mixed 8/16 bit mapping useful for Japanese, Chinese and Korean. Format 4 is used for 16 bit mappings. Format 6 is used for dense 16 bit mappings.
Formats 8, 10, and 12 (properly 8.0, 10.0, and 12.0) are used for mixed 16/32-bit and pure 32-bit mappings. This supports text encoded with surrogates in Unicode 2.0 and later.
'cmap'
format 0Format 0 is suitable for fonts whose character codes and glyph indices are restricted to a single byte. It is the standard Apple character to glyph index mapping table.
Table 8: 'cmap'
format 0
Type | Name | Description |
---|---|---|
UInt16 | format | Set to 0 |
UInt16 | length | Length in bytes of the subtable (set to 262 for format 0) |
UInt16 | language | Language code for this encoding subtable, or zero if language-independent |
UInt8 | glyphIndexArray[256] | An array that maps character codes to glyph index values |
'cmap'
format 2The format 2 mapping subtable type is used for fonts containing Japanese, Chinese, or Korean characters. The code standards used in this table are supported on Macintosh systems in Asia. These fonts contain a mixed 8/16-bit encoding, in which certain byte values are set aside to signal the first byte of a 2-byte character. These special values are also legal as the second byte of a 2-byte character.
Table 9 shows the format of a format 2 encoding subtable. The subHeaderKeys
array maps each possible high byte into a particular member of the suborders array. This allows the determination of whether or not a second byte is used. In either case, the path leads into the glyphIndexArray
from which the mapped glyph index is obtained. The sequence of operations is as follows:
Consider a high byte, i
, designating an integer between 0 and 255. The value subHeaderKeys[i]
, divided by 8, is the index k
into the subHeaders
array. The value k
equals 0 is special. It means that i
is a one-byte code and no second byte will be referenced. If k
is positive, then i
is the high-byte of a two-byte code and its second byte j
will be consumed.
Table 9: 'cmap'
format 2
Type | Name | Description |
---|---|---|
UInt16 | format | Set to 2 |
UInt16 | length | Total table length in bytes |
UInt16 | language | Language code for this encoding subtable, or zero if language-independent |
UInt16 | subHeaderKeys[256] | Array that maps high bytes to subHeaders: value is index * 8 |
UInt16 * 4 | subHeaders[variable] | Variable length array of subHeader structures |
UInt16 | glyphIndexArray[variable] | Variable length array containing subarrays |
The subHeader data type is a 4-word structure defined by the C-language structure shown below:
typedef struct {
UInt16 firstCode;
UInt16 entryCount;
int16 idDelta;
UInt16 idRangeOffset;
} subheader;
If k
is positive, then the four values belonging to subheaders[k]
are used as follows with firstCode
and entryCount
defining the allowable range for the second byte j
:
firstCode <= j < (firstCode + entryCount)
If j
is outside this range, index 0 (the missing character glyph) is returned. Otherwise, idRangeOffset
is used to identify the associated range within the glyphIndexArray
. The glyphIndexArray
immediately follows the subHeaders
array and may be loosely viewed as an extension to it. The value of the idRangeOffset
is the number of bytes past the actual location of the idRangeOffset
word where the glyphIndexArray
element corresponding to firstCode
appears. If p
is zero, it is returned directly. If p
is nonzero, p = p + idDelta
is returned. The sum is reduced modulo 65536, if necessary.
For the one-byte case with k
= 0, the structure subHeaders[0]
will show firstCode
= 0, entryCount
= 256, and idDelta
= 0. The idRangeOffset
will point, as previously discussed, to the beginning of the glyphIndexArray
. Indexing i
words into this array gives the returned value p = glyphIndexArray[i]
.
'cmap'
format 4Format 4 is a two-byte encoding format. It should be used when the character codes for a font fall into several contiguous ranges, possibly with holes in some or all of the ranges. That is, some of the codes in a range may not be associated with glyphs in the font. Two-byte fonts that are densely mapped should use Format 6.
The table begins with the format number, the length and language. The format-dependent data follows. It is divided into three parts:
Table 10: Format 4
Type | Name | Description | |
---|---|---|---|
UInt16 | format | Format number is set to 4 | |
UInt16 | length | Length of subtable in bytes | |
UInt16 | language | Language code for this encoding subtable, or zero if language-independent | |
UInt16 | segCountX2 | 2 * segCount | |
UInt16 | searchRange | 2 * (2**FLOOR(log2(segCount))) | |
UInt16 | entrySelector | log2(searchRange/2) | |
UInt16 | rangeShift | (2 * segCount) - searchRange | |
UInt16 | endCode[segCount] | Ending character code for each segment, last = 0xFFFF. | |
UInt16 | reservedPad | This value should be zero | |
UInt16 | startCode[segCount] | Starting character code for each segment | |
UInt16 | idDelta[segCount] | Delta for all character codes in segment | |
UInt16 | idRangeOffset[segCount] | Offset in bytes to glyph indexArray, or 0 | |
UInt16 | glyphIndexArray[variable] | Glyph index array |
The number of segments is specified by the variable segCount
. This variable is not explicitly used in the Format 4 table, however it is the number from which all of the table parameters are derived. The segCount
is the number of contiguous code ranges in the font. The searchRange
value is twice the largest power of 2 that is less than or equal to segCount
.
Example Format 4 subtable values are shown in this table:
segCount | 39 | Not calculated; determined from the organization of the glyph indices |
searchRange | 64 | (2 * (largest power of 2 <= 39)) = 2 * 32 |
entrySelector | 5 | (log2(the largest power of 2 < segCount)) |
rangeShift | 14 | (2 * segCount) - searchRange = (2 * 39) - 64 |
Each segment is described by a startCode
, an endCode
, an idDelta
and an idRangeOffset
. These are used for mapping the character codes in the segment. The segments are sorted in order of increasing endCode values.
To use these arrays, it is necessary to search for the first endCode
that is greater than or equal to the character code to be mapped. If the corresponding startCode
is less than or equal to the character code, then use the corresponding idDelta
and idRangeOffset
to map the character code to the glyph index. Otherwise, the missing character glyph is returned. To ensure that the search will terminate, the final endCode
value must be 0xFFFF
. This segment need not contain any valid mappings. It can simply map the single character code 0xFFFF
to the missing character glyph, glyph 0.
If the idRangeOffset
value for the segment is not 0, the mapping of the character codes relies on the glyphIndexArray
. The character code offset from startCode
is added to the idRangeOffset
value. This sum is used as an offset from the current location within idRangeOffset
itself to index out the correct glyphIdArray
value. This indexing method works because glyphIdArray
immediately follows idRangeOffset
in the font file. The glyph index is given by the following equation:
glyphIndex = idRangeOffset[i]/2 + (c - startCode[i]) + idRangeOffset[i]
Division by 2 in this equation is required to convert the value into bytes. If the idRangeOffset
is 0, the idDelta
value is added directly to the character code to get the corresponding glyph index:
glyphIndex = idDelta[i] + c
NOTE: All idDelta[i]
arithmetic is modulo 65536.
The following table gives an example of the parameters required to map characters 10-20, 30-90, and 100-153 to a contiguous range of glyph indices. The parameter segCount
= 4 for this example. This table gives the mapping variant parameter values for a Format 4 subtable example. The example data demonstrates how the character-to glyph index mapping values are calculated. Assumptions for this table are that segCountX2
is 8, searchRange
is 8, entrySelector
is 2, rangeShift
is 0.
Name | Segment 1 Chars 10-20 |
Segment 2 Chars 30-90 |
Segment 3 Chars 100-153 |
Segment 4 Missing Glyph |
---|---|---|---|---|
endCode | 20 | 90 | 153 | 0xFFFF |
startCode | 10 | 30 | 100 | 0xFFFF |
idDelta | -9 | -18 | -27 | 1 |
idRangeOffset | 0 | 0 | 0 | 0 |
This table performs the following mappings:
10 is mapped to 10-9 or 1
20 is mapped to 20-9 or 11
30 is mapped to 30-18 or 12
90 is mapped to 90-18 or 72
and so on.
'cmap'
format 6Format 6 is used to map 16-bit, 2-byte, characters to glyph indexes. It is sometimes called the trimmed table mapping. It should be used when character codes for a font fall into a single contiguous range. This results in what is termed a
Table 11: 'cmap'
format 6
Type | Name | Description |
---|---|---|
UInt16 | format | Format number is set to 6 |
UInt16 | length | Length in bytes |
UInt16 | language | Language code for this encoding subtable, or zero if language-independent |
UInt16 | firstCode | First character code of subrange |
UInt16 | entryCount | Number of character codes in subrange |
UInt16 | glyphIndexArray[entryCount] | Array of glyph index values for character codes in the range |
The firstCode
and entryCount
values in the subtable specify the useful subrange within the range of possible character codes. The range begins with firstCode
and has a length equal to entryCount
. Codes outside of this subrange are assumed to be missing and are mapped to the glyph with index 0. For a code within the subrange, its offset from the firstCode
in the subrange is used as an index into the glyphIndexArray
. That array provides the glyph index associated with that character code.
'cmap'
format 8.0Mixed 16-bit and 32-bit coverageFormat 8.0 is a bit like format 2, in that it provides for mixed-length character codes. If a font contains Unicode surrogates, it's likely that it will also include other, regular 16-bit Unicodes as well. This requires a format to map a mixture of 16-bit and 32-bit character codes, just as format 2 allows a mixture of 8-bit and 16-bit codes. A simplifying assumption is made: namely, that there are no 32-bit character codes which share the same first 16 bits as any 16-bit character code. This means that the determination as to whether a particular 16-bit value is a standalone character code or the start of a 32-bit character code can be made by looking at the 16-bit value directly, with no further information required.
Here's the format 8 subtable format:
Type | Name | Description |
Fixed32 | format | Subtable format; set to 8.0 |
UInt32 | length | Byte length of this subtable (including the header) |
UInt32 | language | Language code for this encoding subtable, or zero if language-independent |
UInt8 | is32[65536] | Tightly packed array of bits (8K bytes total) indicating whether the particular 16-bit (index) value is the start of a 32-bit character code |
UInt32 | nGroups | Number of groupings which follow |
Here follow the individual groups. Each group has the following format:
Type | Name | Description |
UInt32 | startCharCode | First character code in this group; note that if this group is for one or more 16-bit character codes (which is determined from the is32 array), this 32-bit value will have the high 16-bits set to zero |
UInt32 | endCharCode | Last character code in this group; same condition as listed above for the startCharCode |
UInt32 | startGlyphCode | Glyph index corresponding to the starting character code |
A few notes here. The endCharCode
is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode
be there explicitly saves the necessity of an addition per group.
The presence of the packed array of bits indicating whether a particular 16-bit value is the start of a 32-bit character code is useful even when the font contains no glyphs for a particular 16-bit start value. This is because the system software often needs to know how many bytes ahead the next character begins, even if the current character maps to the missing glyph. By including this information explicitly in this table, no "secret" knowledge needs to be encoded into the OS.
Thus, although cmap format 8.0 is well-suited for Unicode text encoded using surrogates, it also has the flexibility to be used with other character set encodings.
To determine if a particular word (cp
) is the first half of thirty-two bit code points, one can use an expression such as ( is32[ cp / 8 ] & ( 1 << ( cp % 8 ) ) )
. If this is non-zero, then the word is the first half of a thirty-two bit code point.
0 is not a special value for the high word of a 32-bit code point. A font may not have both a glyph for the code point 0x0000
and glyphs for code points with a high word of 0x0000
.
'cmap'
format 10.0Trimmed arrayFormat 10.0 is a bit like format 6, in that it defines a trimmed array for a tight range of 32-bit character codes:
Type | Name | Description |
Fixed32 | format | Subtable format; set to 10.0 |
UInt32 | length | Byte length of this subtable (including the header) |
UInt32 | language | 0 if don't care |
UInt32 | startCharCode | First character code covered |
UInt32 | numChars | Number of character codes covered |
UInt16 | glyphs[] | Array of glyph indices for the character codes covered |
'cmap'
format 12.0Segmented coverageFormat 12.0 is a bit like format 4, in that it defines segments for sparse representation in 4-byte character space. Here's the subtable format:
Type | Name | Description |
Fixed32 | format | Subtable format; set to 12.0 |
UInt32 | length | Byte length of this subtable (including the header) |
UInt32 | language | 0 if don't care |
UInt32 | nGroups | Number of groupings which follow |
Here follow the individual groups, each of which has the following format:
Type | Name | Description |
UInt32 | startCharCode | First character code in this group |
UInt32 | endCharCode | Last character code in this group |
UInt32 | startGlyphCode | Glyph index corresponding to the starting character code |
Again, the endCharCode
is used, rather than a count, because comparisons for group matching are usually done on an existing character code, and having the endCharCode
be there explicitly saves the necessity of an addition per group.
All cmap subtable formats are supported on Mac OS X 10.2. The Mac OS does not require specific formats for any particular cmap subtable.
Newton fonts use the older, format 0, 2, 4, and 6 encoding subtables only. Formats 8.0, 10.0, and 12.0 are not supported.
The 'cmap'
table references glyph indices. As such, the glyph indices must be valid for the particular font and cannot exceed the number of glyphs, which is found in the maximum profile table.
The main tool for editing 'cmap'
tables is ftxdumperfuser. Note that ftxdumperfuser supports all seven 'cmap'
subtable formats and supports supplementary Unicode characters using their Unicode scalar values.
The original architecture of the Unicode Standard allowed for all encoded characters to be represented using sixteen bit code points. This allowed for up to 65,354 characters to be encoded. (Unicode code points U+FFFE
and U+FFFF
are reserved and unavailable to represent characters. For more details, see The Unicode Standard.) As such, Unicode differed from other character set encodings, some of which represent all characters with eight bits, and others of which have some characters eight bits in size and others sixteen.
During the course of development of version 2.0 of Unicode, it became clear that this would not provide sufficient code points to cover the entire repetoire of required characters. To solve the problem, an extension mechanism was adopted which involved surrogates. These are special Unicode code points which come in pairs, a high surrogate (U+D800
through U+DBFF
)and a low surrogate (U+DC00
through U+DFFF
). An algorithm is defined to map properly paired surrogates to a single 32-bit entitle called a scalar value, which represents a single character.
Unicode 2.0 and 3.0 do not actually encode any characters using surrogates, but Unicode 3.1 was published in March 2001 and includes over 40,000 characters encoded requiring surrogates. Future versions of the Unicode standard will include still more characters encoded using surrogates.
Unicode text encoded using sixteen-bit code points and surrogates is referred to as UTF-16. The cmap format 8.0 is appropriate to use for UTF-16 text. Note that in this case, 0x0000
is always a code point in its own right and never the first half of a two-word sequence.
The Unicode Technical Committee has adopted a 32-bit form of Unicode text whereby every character is represented by a single 32-bit code. This is referred to as UTF-32. Cmap formats 10.0 and 12.0 are appropriate for UTF-32 text.
There is also an eight-bit representation of Unicode text, referred to as UTF-8. UTF-8 is frequently used in exchange protocols that assume C-like strings, where a zero byte is used as a string terminator (along with other single bytes with special interpretations). There are no cmap formats defined appropriate for use with UTF-8 text.
'cmap'
tables containing format 8.0, 10.0, or 12.0 data. Changed references to DumpCMAP and FuseCMAP to DumperFuser. Updated information on Unicode 3.1 publication. Fixed some typos.
Last updated: JHJ