Characters (GCGID) | Code page information

↑ Up one level

Disclaimers

These observations were taken from analysing the GCGIDs used in published code pages and repertoires, plus the limited information available in this former section of the IBM website (link is to Wayback Machine). They are provided here for informational purposes only.

Over-arching GCGID structure

A GCGID (Graphic Character Global Identifier) consists of eight characters, each of which must be either a capital ASCII letter or an ASCII digit. The first character of a GCGID is always a letter, and defines the overarching category of the GCGID; the others vary according to category.

For most GCGIDs, the basic identity of the character is actually only defined by the first four characters, with the remaining four setting attributes such as fullwidth, superscript, or variant forms. The exception to this is those which begin with IX, EX, FX, TX or U, which do not follow that structure. Furthermore, most types of GCGID (all except those starting with E, F, I, T, X or U) enforce that the first two characters in the GCGID are both letters and the next two are both digits.

Orthodox structure

For the categories that fully follow the GCGID structure rather than deviating to a greater or lesser extent, the positions break down as follows. The second half is applicable to more categories than the first.

  1. ASCII letter, general category of character as noted above and elaborated below.
  2. ASCII letter, more specific. Under N or S, this is a subcategory. Elsewhere, this is usually a Basic Latin base letter which could (with or without certain modifiers) correspond to the character; for kana, this is the consonant if applicable or the vowel otherwise. If the character itself is inherently a modifier and/or transliterates to a modifier, this is X; kana also uses Q for punctuation.
  3. An ASCII digit.
  4. A second ASCII digit. These two digits have the following meanings:
    • For Latin letters, capitals are always even, bare lowercase is 01, bare uppercase is 02 and versions modified with diacritics are 11 or higher. Furthermore, Hebrew, Greek, Cyrillic and to a lesser extent Arabic letters are generally coördinated with whatever diacritics or modifiers appear in the Latin transcription, though Arabic may use 02–10 for letter modifications not correponding to a diacritic.
    • For kana, the first digit is 1 through 5 denoting the vowel, 0 if the character is itself a vowel, or 7 if the character is a modifier or punctuation. For syllabograms, second is 1 if small and 0 otherwise; for a modifier, 0 is the vowel extender, 1 the dakuten and 2 the handakuten.
    • For characters under N or S, this is an index within the subcategory. Under N, this does not necessarily correpond to the numerical value but does sometimes, this is detailed further below.
  5. 0 for full-height spacing, 1 for superscript, 2 for subscript and 8 for combining. Big5-style small horizontal forms are treated the same as the corresponding superscripts for some reason, so the appropriate mapping for GCGIDs with 1 here and 8 in the width field can vary.
  6. ASCII digit, size/position variant, usually for a diacritic but sometimes used in semigraphics et cetera.
  7. Width attribute: Z for zero-width, H for explicitly halfwidth, 8 for fullwidth. In katakana and jamo, 0 denotes halfwidth in practice, otherwise 0 is used for neutral, natural, proportional or unspecified width. In AFP/FOCA code page resources with names starting with T1H, it is fairly typical for GCGIDs which have 0 here in the regular code page definition to instead have H here.
  8. A character, usually an ASCII digit (though it can start using letters if there are more than ten needed) identifying a character variant, or a character which might in specific contexts be interchangeable. These may be coördinated between coördinate characters (e.g. Arabic letters) to an extent, but there is no overarching scheme for all or even most characters.

Relationship to Unicode

It is not always true that one GCGID can be considered equivalent to one Unicode code point: CJK characters have GCGIDs assigned tautologically from their language-specific EBCDIC encodings, some distinct GCGIDs are unified by Unicode (Yen sign (SC050000) and Yuan sign (SC120000) for instance), and some GCGIDs encompass multiple others (there are three separate GCGIDs for capital ð (LD640000), capital đ (LD600000), and the unification of the two (LD620000), for example).

There is no one authoritative mapping from GCGIDs to Unicode, only from a given codepage to both GCGIDs and Unicode strictly within the context of that codepage; C-H 3-3220-024 2002-11, C-H 3-3220-126 2016-04, this former section of the IBM website (link is to Wayback Machine), the IBM AFP ECPs, and several ICU codecs all include some IBM-originated Unicode mappings for a subset of GCGIDs, but do not always agree on the Unicode mapping for a given GCGID, and none of them map anything close to all GCGIDs.

Top-level categories of GCGID

Links are provided to lists of names and Unicode mappings for each category. These mappings are not guaranteed to interoperate with anything, and are not stablised (I may amend them without warning), but are purely illustrative. Since GCGIDs do not map one-to-one with Unicode, there is also no guarantee to round-trip.

GCGID overview table

GCGID overview table

Structure of E/F/I/T/X GCGIDs

In the specification documents, E/F/I/T/X GCGIDs are codepoints from a double-byte host code (i.e. EBCDIC Shift Out) transformed/compressed to the second, third and fourth chars. E refers to the Simplified Chinese code (CPGID 837), F to the Korean one (CPGID 834), I to the Japanese one (CPGID 300) and T to the Traditional Chinese one (835). The transformation from coding bytes to GCGID in this case is base36(((firstByte - 0x40) * 0xC0) + (secondByte - 0x40)), where Base-36 uses 0–9 then A–Z (this means 0 is contrastive with O). This variant of the E, I and T series is exclusively for hanzi; however, this variant of the F series also includes Korean syllables (more numerous than hanzi), encircled forms, box drawing and other characters, even in some cases those which also have orthodox GCGIDs.

Another variant has ?XA0, where ? is replaced with the series letter, followed by the hexadecimal of the EBCDIC representation (this works since the last valid Base-36 representation is ?SAM for 0xFE 0xFE); rather than using a separate X series, UDC in this variant use ?XU0 under the usual series. This variant of GCGIDs is not limited to hanzi, and appears in ECP code page files used as an alternative both to the regular E/F/I/T/X GCGIDs and to GCUIDs; it is used, in fact, for all characters that are represented in double-byte host code with a lead byte of 0x45 or greater, regardless of if they have other GCGIDs assigned in the respective specifications. IBM seem to refer to this type as "AFP GCGIDs", e.g. in the CPGID 1445 changelog.

Subcategories of S GCGIDs

Subcategories of N GCGIDs

Variant attribute for A GCGIDs (Arabic letters)

Diacritic IDs

These are used as the third and fourth characters in A (Arabic), G (Greek), K (Cyrillic) and L (Latin) GCGIDs. The odd (lowercase) ones besides 65 and 67 are also used as the third and fourth characters in SD (Symbol, Diacritic) IDs; the SD ranges 64–68, 80–89 and 92–97 are a bit awkward in also using even numbers which don't correspond to the letter GCGID use. Numbers below 11 are used in Arabic for letter modifications that do not correpond to a diacritic in transliteration. Especially on non-Latin scripts, a diacritic IDs does not necessarily correspond to an actual diacritic, since are often used to refer to what are different letters in that script, but which are distinguished in Roman transcription with that diacritic.