Characters (GCGID) | Code page information

Disclaimers

These observations were taken from analysing the GCGIDs used in published code pages and repertoires, plus the limited information available in this former section of the IBM website (link is to Wayback Machine). They are provided here for informational purposes only.

Over-arching GCGID structure

A GCGID (Graphic Character Global Identifier) consists of eight characters, each of which must be either a capital ASCII letter or an ASCII digit. The first character of a GCGID is always a letter, and defines the overarching category of the GCGID; the others vary according to category.

For most GCGIDs, the basic identity of the character is actually only defined by the first four characters, with the remaining four setting attributes such as fullwidth, superscript, or variant forms. The exception to this is those which begin with IX, EX, FX, TX or U, which do not follow that structure. Furthermore, most types of GCGID (all except those starting with E, F, I, T, X or U) enforce that the first two characters in the GCGID are both letters and the next two are both digits.

Orthodox structure

For the categories that fully follow the GCGID structure rather than deviating to a greater or lesser extent, the positions break down as follows. The second half is applicable to more categories than the first.

ASCII letter, general category of character as noted above and elaborated below.
ASCII letter, more specific. Under N or S, this is a subcategory. Elsewhere, this is usually a Basic Latin base letter which could (with or without certain modifiers) correspond to the character; for kana, this is the consonant if applicable or the vowel otherwise. If the character itself is inherently a modifier and/or transliterates to a modifier, this is X; kana also uses Q for punctuation.
An ASCII digit.
A second ASCII digit. These two digits have the following meanings:
- For Latin letters, capitals are always even, bare lowercase is 01, bare uppercase is 02 and versions modified with diacritics are 11 or higher. Furthermore, Hebrew, Greek, Cyrillic and to a lesser extent Arabic letters are generally coördinated with whatever diacritics or modifiers appear in the Latin transcription, though Arabic may use 02–10 for letter modifications not correponding to a diacritic.
- For kana, the first digit is 1 through 5 denoting the vowel, 0 if the character is itself a vowel, or 7 if the character is a modifier or punctuation. For syllabograms, second is 1 if small and 0 otherwise; for a modifier, 0 is the vowel extender, 1 the dakuten and 2 the handakuten.
- For characters under N or S, this is an index within the subcategory. Under N, this does not necessarily correpond to the numerical value but does sometimes, this is detailed further below.
0 for full-height spacing, 1 for superscript, 2 for subscript and 8 for combining. Big5-style small horizontal forms are treated the same as the corresponding superscripts for some reason, so the appropriate mapping for GCGIDs with 1 here and 8 in the width field can vary.
ASCII digit, size/position variant, usually for a diacritic but sometimes used in semigraphics et cetera.
Width attribute: Z for zero-width, H for explicitly halfwidth, 8 for fullwidth. In katakana and jamo, 0 denotes halfwidth in practice, otherwise 0 is used for neutral, natural, proportional or unspecified width. In AFP/FOCA code page resources with names starting with T1H, it is fairly typical for GCGIDs which have 0 here in the regular code page definition to instead have H here.
A character, usually an ASCII digit (though it can start using letters if there are more than ten needed) identifying a character variant, or a character which might in specific contexts be interchangeable. These may be coördinated between coördinate characters (e.g. Arabic letters) to an extent, but there is no overarching scheme for all or even most characters.

Relationship to Unicode

It is not always true that one GCGID can be considered equivalent to one Unicode code point: CJK characters have GCGIDs assigned tautologically from their language-specific EBCDIC encodings, some distinct GCGIDs are unified by Unicode (Yen sign (SC050000) and Yuan sign (SC120000) for instance), and some GCGIDs encompass multiple others (there are three separate GCGIDs for capital ð (LD640000), capital đ (LD600000), and the unification of the two (LD620000), for example).

There is no one authoritative mapping from GCGIDs to Unicode, only from a given codepage to both GCGIDs and Unicode strictly within the context of that codepage; C-H 3-3220-024 2002-11, C-H 3-3220-126 2016-04, this former section of the IBM website (link is to Wayback Machine), the IBM AFP ECPs, and several ICU codecs all include some IBM-originated Unicode mappings for a subset of GCGIDs, but do not always agree on the Unicode mapping for a given GCGID, and none of them map anything close to all GCGIDs.

Top-level categories of GCGID

Links are provided to lists of names and Unicode mappings for each category. These mappings are not guaranteed to interoperate with anything, and are not stablised (I may amend them without warning), but are purely illustrative. Since GCGIDs do not map one-to-one with Unicode, there is also no guarantee to round-trip.

A: Arabic letters (A: Arabic). Since this includes a different collection of Arabic presentation forms to the one Unicode does, I also have a summary of Arabic presentation form GCGIDs without standard Unicode equivalents.
B: Thai letters and Thai diacritics (B: "Brahmic"? T was of course taken by Traditional Chinese, and other Brahmic scripts such as Devanagari are represented solely using GCUIDs in the codepages I've seen.)
D: Bar code segments.
E: Simplified Chinese. Structure as stated below.
F: Korean syllables and Hanja. Structure as stated below.
G: Greek letters (G: Greek)
H: Hebrew letters (H: Hebrew)
I: Japanese Kanji (I: "Ideographic"?). Structure as stated below.
J: Katakana, and general kana modifiers and punctuation (J: "Japanese katakana")
K: Cyrillic letters (K: "Kirilica")
L: Latin letters (L: Latin)
N: Numeric characters (N: Numeral or Numeric)
O: Korean individual letters
R: Hiragana (R: "hiRagana"?)
S: Symbols, punctuation and Zhuyin (S: Symbol or Special)
T: Traditional Chinese (T: "Traditional Chinese). Structure as stated below.
U: Codepoint usually from Unicode in hexadecimal in the third through eighth characters, while the second character stipulates a glyph variant, similarly to the last in the orthodox structure (usually 0, while 5 is used for some Traditional Chinese hanzi). This is a special type of GCGID called a GCUID. However, taking advantage of the fact that the code point is constrained by its length to the historic Group 0 (U+0000–FFFFFF) and not necessarily to Unicode (U+0000–10FFFF), GCUIDs starting with U0F are sometimes used to represent what would either be a combining sequence in Unicode, or which has other mapping issues like a standard mapping a specific glyph to PUA—with the digit after the "F" being either "F" for "Far East" (i.e. JIS X 0213 or GB 18030) or "C" for "Composed" (i.e. anything else).
X: UDC (user-defined EUDC) identified by base36 digits in second, third and fourth chars derived from the EBCDIC in the same manner as the E/F/I/T GCGIDs, as explained below (annoyingly, X GCGIDs are directly listed only in the four "EUC" specs, which for everything except Traditional Chinese only support a subset of the UDCs supported by the EBCDIC Host Codes themselves). I have attempted to put together a single mapping for the X series, although that's not always what is used in practice since the Japanese UDC range engulfs the entirety of the Simplified Chinese one, and the Traditional Chinese one does likewise to the Korean one. An overview of X series GCGID usage is arguably more useful.

GCGID overview table

Structure of E/F/I/T/X GCGIDs

In the specification documents, E/F/I/T/X GCGIDs are codepoints from a double-byte host code (i.e. EBCDIC Shift Out) transformed/compressed to the second, third and fourth chars. E refers to the Simplified Chinese code (CPGID 837), F to the Korean one (CPGID 834), I to the Japanese one (CPGID 300) and T to the Traditional Chinese one (835). The transformation from coding bytes to GCGID in this case is base36(((firstByte - 0x40) * 0xC0) + (secondByte - 0x40)), where Base-36 uses 0–9 then A–Z (this means 0 is contrastive with O). This variant of the E, I and T series is exclusively for hanzi; however, this variant of the F series also includes Korean syllables (more numerous than hanzi), encircled forms, box drawing and other characters, even in some cases those which also have orthodox GCGIDs.

Another variant has ?XA0, where ? is replaced with the series letter, followed by the hexadecimal of the EBCDIC representation (this works since the last valid Base-36 representation is ?SAM for 0xFE 0xFE); rather than using a separate X series, UDC in this variant use ?XU0 under the usual series. This variant of GCGIDs is not limited to hanzi, and appears in ECP code page files used as an alternative both to the regular E/F/I/T/X GCGIDs and to GCUIDs; it is used, in fact, for all characters that are represented in double-byte host code with a lead byte of 0x45 or greater, regardless of if they have other GCGIDs assigned in the respective specifications. IBM seem to refer to this type as "AFP GCGIDs", e.g. in the CPGID 1445 changelog.

Subcategories of S GCGIDs

A: Arithmetic or mathematical symbols
B: Zhuyin (Bopomofo), including tone marks. Being under S, these are all numbered, and not ID'd by transliteration.
C: Currency symbols
D: Diacritics, both spacing and nonspacing
E: Control Pictures
F: Pseudographics
G: Complete sets of pieces of brackets and integrals, some of which also appear in incomplete sets in other subcategories.
H: Diagonal pseudographics (IBM refers to these as "chemical" graphics, presumably intending them to be used in combination with the SF series for structural or skeletal formulae)
I: Ideographic description characters
L: APL symbols, sometimes duplicating other subcategories.
M: Miscellaneous symbols
O: OCR and MICR symbols
P: Punctuation
S: "Supplementary" symbols, i.e. further miscellaneous symbols.
V: "Various" symbols, i.e. even more miscellaneous symbols.

Subcategories of N GCGIDs

C: Suzhou numeral. Index is equal to numeric value. "C" apparently stands for "Chinese", although "Chinese numeral" would more commonly mean hanzi numerals.
D: Decimal place value numeral (Hindu-Arabic). Index is 10 for zero, numeric value otherwise. Eighth char is 0 for West Arabic, 1 for East Arabic, 2 for Thai, 3 for Persian, 4 for Urdu; the zero also has 7 (circular for CJK typography) and 8 (slashed). In places where the glyphs match, code pages try to use a East Arabic GCGID in a set of Persian or Urdu digits, or use a Urdu GCGID in a set of Persian digits, to avoid proliferating duplicates.
F: Fraction. Index is effectively arbitrary, though a minority have minor mnemonic merit.
O: Enclosed numeral or list marker (ordinal). Index is equal to numeric value. Eighth char is 0 for encircled West Arabic, 1 for parenthesised West Arabic, 2 for a West Arabic list marker with a full stop, or 3 for parenthesised hanzi.
R: Roman. Small indices are allocated in single-case decades, with 01–10 being lowercase 1–10, 11–20 being uppercase 1–10, 21–30 being lowercase 11–20 and 31–40 being uppercase 11–20. Indices 90–93 are for fifty, one hundred, five hundred and one thousand.

Variant attribute for A GCGIDs (Arabic letters)

0: Isolated (generally), Isolated First Part (seen-like letters, i.e. truncates left descender at the extremum) or Isolated-Initial (tah-like letters, i.e. makes no attempt to prevent connection to the left end of the stroke).
1: Isolated-Final (alephs and yehs; for straight alephs, this is unhooked but slanted to optionally connect to the right, while for the others, the stroke starts low enough to connect without a connection hook) or Isolated (tah-like letters, i.e. gently curves left end of the stroke to prevent connection).
2: Final (connection to right)
3: Initial (connection to left)
4: Medial (connection to both sides)
6: For letters which take a left descender, this means an isolated form with a fully drawn left descender. For alephs, this is the "after lam" form with a left flick. Applied to a kaf (AK01), this denotes a Persian form, i.e. keheh (despite AK02 being used for other forms of keheh). Applied to a lam, this is an initial form, but with a gentle curve into the connection (similar to isolated) rather than the 3 form's right angle (similar to medial).
7: Initial-Medial
9: Generic

Diacritic IDs

These are used as the third and fourth characters in A (Arabic), G (Greek), K (Cyrillic) and L (Latin) GCGIDs. The odd (lowercase) ones besides 65 and 67 are also used as the third and fourth characters in SD (Symbol, Diacritic) IDs; the SD ranges 64–68, 80–89 and 92–97 are a bit awkward in also using even numbers which don't correspond to the letter GCGID use. Numbers below 11 are used in Arabic for letter modifications that do not correpond to a diacritic in transliteration. Especially on non-Latin scripts, a diacritic IDs does not necessarily correspond to an actual diacritic, since are often used to refer to what are different letters in that script, but which are distinguished in Roman transcription with that diacritic.

11/12: Acute
13/14: Grave
15/16: Circumflex
17/18: Trema
19/20: Tilde
21/22: Hachek
23/24: Breve
25/26: Double Acute
27/28: Overring
29/30: Overdot
31/32: Macron
33/34: Right half ring
35/36: Left half ring
41/42: Cedilla
43/44: Ogonek
45/46: Underdot
47/48: Underline
51/52: Primary ligature by first component (the letters ae, oe and ij, plus the ff ligature)
53/54: Second ligature by first component (fi)
55/56: Third ligature by first component (fl)
57/58: Fourth ligature by first component (ffi)
59/60: Fifth ligature by first component (ffl), also used for the disunified capital D-stroke
61/62: Stroke (including on the letters D, H, L, O, T and Cyrillic Gamma, with those on L and O being diagonal). This is also used to mark up the dotless i and the additional letters eszett (when applied to S), eng (when applied to N), and kra (when applied to K). The capital D under 62 is a unification with eth, while the disunified version is under 60 instead.
63/64: Side-dot (e.g. on L but usually as a standalone modifier under SD). This is also used to mark up the additional Latin letters Thorn and Eth (including a disunified capital for the latter), and the Afrikaans apostrophe-n.
65/66: Vertical stroke (in Cyrillic; also, SD65 (not SD66) is a paseq or vertical line)
67/68: Geresh; used in Cyrillic for schwa, Bashkir ka, barred O, straight U.
69/70: (Used in Cyrillic for barred straight U.)
71/72: Macron-Acute
73/74: Trema-Acute
75/76: Trema-Grave
77/78: Trema-Macron
79/80: Trema-Hachek
91/92: Horn