Code page information

Upfront disclaimer

All IBM-sourced files are mirrored from the listed public locations, for ease of access (especially significant for the subset only found over Wayback Machine mirrors) and sake of preservation of what I feel is important information in the public interest. I do not work for IBM, and have not ever worked for IBM as of the time of writing.

For the sake of accurately documenting the code pages, repertoires and encodings, the definitions given and presented for them are intended to be, where possible, equivalent to those implemented and/or given elsewhere (except where errors or other issues are identified in the source, or information is absent); sources are linked where possible. Furthermore, where possible, practical and unambiguous, identifying names given for code pages, encodings, repertoires and character GCGIDs reflect those used elsewhere, for example in linked sources, although they may be corrected, abridged or disambiguated where this is deemed appropriate.

Except where noted otherwise above or elsewhere, everything else is my own work. For my own contributions, including by way of example derivation, extraction and/or reconstruction of machine-readable indices, consolidated or customised GCGID maps, explanations and commentary, computed information such as superset/subset relationships, other custom presentation of information, and interactive software including file export options, I guarantee neither accuracy or stability: I may amend them at any time.

GCGIDs (Characters; Graphic Character Global Identifiers)

A GCGID is an eight-letter code assigned to a character to enable mapping of characters between code pages, and from code pages to certain types of font. They do not always correspond one-to-one to Unicode code points.

⤷ Information about GCGIDs

GCSGIDs (Repertoires; Graphic Character Set Global Identifiers)

These are "graphic character sets" (GCS or CS), by which IBM means what would nowadays be called a "noncoded character set" (NCS) or "character repertoire". Each GCSGID corresponds to a collection of GCGIDs. Two encodings with the same GCSGID (or, in the case of a variable-width encoding, combination of GCSGIDs) can be symmetrically transcoded. The special value 65520 (0xFFF0) denotes the empty set; the special value 65535 (0xFFFF) denotes a "growing" (open) set of characters which can be enlarged at any time.

GCSGIDs occupy an entirely separate namespace from CPGIDs and CCSIDs.

⤷ List of GCSGIDs

FGIDs (Fonts; Font Global Identifiers)

Each font is assigned an FGID. A resource containing glyphs for a font will generally be associated with a particular FGID and a particular GCSGID, i.e. it implements a particular font (specified by a FGID) for a particular repertoire of GCGIDs (specified by a GCSGID). FGIDs are not tabulated here since they are mostly outside of the scope of this website, but they are mentioned here for completeness, since they help to illustrate the function of a GCSGID as part of a larger system including both FGIDs and CPGIDs.

CPGIDs (Planes; Code Page Global Identifiers)

These are CPGIDs (code page global identifiers), by which IBM means what would nowadays be called a character "plane". Each CPGID corresponds to a mapping of GCGIDs (see explanation of GCSGIDs and GCGIDs above) to fixed-width byte sequences. A variable-width encoding is instead defined at the CCSID level. Like with GCSGIDs, 65520 (0xFFF0) is a special value for an empty plane.

Although the definition of "code page" used in IBM's Character Data Representation Architecture (CDRA), in which it is interchangeable with CPGID, differs from the Microsoft-influenced vernacular meaning (which instead corresponds to a CCSID), CPGIDs and CCSIDs (unlike GCSGIDs) occupy the same namespace in practice. Also in practice, even in IBM contexts, the term "code page" is not limited to CPGIDs: labels for IBM code pages in International Components for Unicode (ICU) in the cpNNN format often tend to have NNN be a CCSID rather than a CPGID; AFP/FOCA code pages often specify a GCSGID (and multiple can exist for a given CPGID); also, this feature ticket refers to a "mixed" (i.e. variable-width multi-component) CCSID as a "code page", for example.

⤷ List of CPGIDs

CGCSGIDs, GCIDs or CHRIDs (Encodings; Coded Graphic Character Set Global Identifiers)

A tuple of a GCSGID and a CPGID is variously referred to in documentation as a CGCSGID, a GCID or a CHRID (no, I don't know what the latter two stand for, though I'd hazard a guess at "GCSGID and CPGID identifier" for "GCID"). In big-endian binary form, it contains a GCSGID in its first two bytes and a CPGID in its last two bytes. In decimal, the GCSGID is likewise written before the CPGID; they are five decimal digits each, although they are typically written separated in some way (e.g. by a hyphen or tab), and leading zeroes may or may not be written. This primarily allows a simple, single-byte or fixed-width encoding to be fully specified with both plane and a fixed repertoire (for subsetting, or for versions before/after expansions)—such as can be mapped onto another code page or a font supporting that repertoire.

If the GCSGID is set to zero within a CGCSGID, this means that the CPGID listed is actually a CCSID (see below). This is most obviously necessary if the encoding is variable-width and uses multiple components, and therefore has more than one underlying CPGID.

ESIDs (Encoding Schemes; Encoding Scheme Identifiers)

Pure CGCSGIDs have the weakness that not all encodings can be adequately expressed as such. In particular, no variable-width encoding can be expressed as such, since they comprise multiple CPGIDs combined using a code extension scheme. Even some fixed-width encodings, however, may require encoding scheme information beyond that specified by a CPGID's default (e.g. the JIS X 0208 plane can be used over either 0x21–7E or 0xA1–FE; the latter is what EUC-JP is a superset of).

To address this, an ESID can be specified, which specifies an encoding scheme; some ESIDs may be specified to elaborate a single CGCSGID, while others must be specified along with multiple CGCSGIDs; some ESIDs also require additional information about how the individual components are represented (sometimes known as an ACRI, for "additional coding-related required information"). Examples of encoding schemes include using a lead-byte range mask (the so-called "PCMB", "Personal Computer Multi-Byte", scheme; the lead-byte mask itself is determined from ACRI), or using a higher encoding system such as EUC.

CCSIDs (Encodings; Coded Character Set Identifiers)

For the most part, one can regard every CPGID to also be a CCSID. However, single-component CCSIDs apply a constraint to a particular GCSGID, so (unless that is the "growing" repertoire) a later expansion to a code page (i.e. one postdating the formalisation of the CCSID system) will generally speaking keep the same CPGID but get assigned a new CCSID (for example, CCSID 1252 is the version of CPGID 1252 (Windows-1252) before €, Ž and ž were added, and the version after that was done is CCSID 5348 but still CPGID 1252). Furthermore, a CCSID specifies an ESID, while a CPGID may have additional applicable ESIDs besides the default.

Additionally, unlike CPGIDs, CCSIDs can define variable-width encodings, as a multiple-component CCSID defined as the combination of multiple single-component CCSIDs, with an overarching ESID and ACRI.

Usually, the high four bits of the (sixteen-bit) CCSID are used to distinguish variants with different GCSGIDs and/or ESIDs (including later expansions), equivalent to adding a multiple of 4096. This is only true for CCSIDs 57343 (0xDFFF) and lower though; higher-numbered CCSIDs have no relation to their remainder by 4096 and are used for special purposes, such as invariant subsets for particular purposes as well as the aforementioned empty plane 65520.

Expansions prior to the formalisation of the CCSID system, by contrast, seem to get assigned the eponymous CCSID. For example, CCSID 37 stands out very prominently in terms of the regions allocated (being a fully-allocated CECP) amongst CCSIDs 1–40, while CCSID 8229 corresponds to CPGID 37's "base set" (as opposed to CECP set) CGCSGID, thus representing the original version of the code page.

A tabulation of CCSIDs is provided. Some details have been expanded or extrapolated compared to IBM's documentation.

⤷ List of CCSIDs

Specifications

Although registrations of GCSGIDs and CPGIDs falls under the umbrella of C-H 3-3220-050, and registrations of GCGIDs falls under the umbrella of C-H 3-3220-055, several other specifications exist, mainly for East Asian multi-byte encodings and their correspondances between EBCDIC ("Host"), PC and EUC representations.

⤷ List of specifications

AFP resource names

Single-component encodings may be identified by the names of their FOCA code pages used with AFP (recognisable by starting with "T1"). These are listed in the CCSIDs list where possible.

⤷ Structure of AFP code page resource names