The strange case of Code Page 354 | Code page information

Background

Code pages 353 through 360 are "PTTC/BCD" or "BCDIC" code pages (in ESID terms, this is D100, denoting "PTTC/BCDIC"). These are variants of an earlier 6-bit encoding (Binary-Coded Decimal Interchange Code) which is the direct precursor to EBCDIC. The name refers to the fact that digits 1 through 9 are assigned to code points 1 through 9 (with 0 being assigned to either 0 or 10), making it a direct evolution of the concept of binary-coded decimal, i.e. representing a numeral as a sequence of decimal digit values represented as binary code units equal to the value of each digit.

A former section of the IBM website, still accessible over the Wayback Machine includes published charts for code pages 353 through 360, excluding 354 and 356. These charts have broken formatting; presumably, they are supposed to be a table, but instead they show as a linear list of bordered cells for each code point, grouped by low nybble(!), followed by several PDF pages of warning messages (code page 353 for example). One wonders whether IBM actually tested the template for generating 6-bit charts, as opposed to 7-bit, 8-bit extended ASCII or 8-bit EBCDIC charts. Nonetheless, they are usable charts, possible to extract the information from.

Another quirk of that former section of the IBM website is that "Spain (00014)" is linked to the same file as "South Africa (00031)" on the row above. This means that a chart for code page 14 is not actually linked from that page. However, a chart for code page 14 is available elsewhere on the IBM website. In the same "accessible and traversable by design" directory on the site, there are also what appears to be PDFs for code pages 354, 1118, 1132, 1155 and 1449. However, they cannot be opened as PDFs: they appear to be corrupt.

Initial analysis

The "PDF" file for code page 354 starts with the following ASCII text:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <meta content="MSHTML 6.00.2900.3314" name="GENERATOR"/> %PDF-1.1

Sadly, the URL commented there is neither active nor available on the Wayback Machine.

So, the PDF has HTML appended at the start and the end. Additionally, the (non-textual) PDF data appears to contain sequences like & and   (the latter presumably based on misinterpretation of the binary data as ISO-8859-1). However, with these obvious changes fixed, it still did not seem to be valid PDF. Hmm…

We aren't entirely on our own though: as I mentioned, a PDF file for code page 1118 also exists in that directory; it is corrupt in the same way but, importantly, a non-corrupted copy of the code page 1118 PDF is available over the Wayback Machine, so they can be directly compared. Usefully, both the 1118 and 354 PDFs use the same compression method (/LZWDecode). Converting both the valid code page 1118 PDF and the corrupted one (with appendages and HTML entity escapes reversed) to sequences of hexadecimal representations of bytes, and taking a comparison using Python's difflib, revealed the following differences, presumably attempts to normalise the (incorrectly) presumed HTML data:

All null bytes had been removed from the corrupted one, completely, not replaced by anything.
All ASCII whitespace bytes (09, 0A, 0B, 0C, 0D and 20) and sequences thereof (e.g. 0C20) had been replaced with either a single space or space+CR+LF, depending solely on how long the "line" already was.
The occasional region seemed to differ much more majorly. I did not yet know what to make of this, not least because I didn't know for sure that the code page 1118 PDFs were absolutely identical to begin with.

So what to do about this?

Chained validators

I mentioned earlier that both the corrupt 354 and 1118 PDFs, in addition to the intact 1118 PDF (and e.g. the 353 PDF) used /LZWDecode—these sections constituted the page data of the PDFs. Clearly, if I was to be able to do anything in terms of reconstructing the LZW blobs, I would need a PDF LZW decoder—fortunately, I found an MIT/Expat-licensed one from PDFsharp. That is in C#, while I usually prefer Python for my personal projects; it is also not particularly idiomatic C# (the NextCode property's getter alters the state that the property value is derived from, so the property has a different value every time it is accessed; this should be done by a method and not a property). The unidiomatic C# had the potential to have clued me in onto it being a not-entirely-correct implementation of PDF LZW (more on that later), but it did not at the time.

More saliently, it was written to pull its data out of a byte array, and in a fairly lenient manner in terms of what data it accepted. This stands in stark contrast to what we need here:

It must accept its input data being dribbled in, byte by byte.
It must flag up an error as soon as it encounters data that appears to be wrong.
It must be clonable, so multiple branching possibilities can be kept around.
It must be chainable, so the output from the LZW decoder (thus verified to be valid LZW) can be verified to be valid and expected PDF instructions.

Firstly, I translated it from C# to Python, and decompressed the valid LZW using it. I used this to identify some rules as to what the output would look like (these could be narrower than merely validating the PDF page data, since having the other BCDIC chart PDFs I could see what sequence of what instructions in what style etc. to expect; that I was basing these patterns on output of the same LZW decoder would prove to be important later). I refactored the LZW decoder heavily to fit the above criteria in its interface and chained it with the page data validator.

Tree management

Exponential blowup was a problem from fairly early on. Since a null byte could potentially be inserted at any position, or a whitespace byte expanded to an arbitrarily long sequence, the number of possibilities skyrocketed very quickly when working over the corrupted data (and, contrary to my hopes, the validation was not enough to damp this). I eventually settled on the following mitigations:

Determine the probability of each possibility so they can be prioritised. This would be generally higher for possibilities that have successfully progressed further without failing to find valid continuation, and generally lower for possibilities with more insertions of null bytes or additional whitespace bytes, since positions where bytes do not need to be inserted would naturally be far more numerous than those where they do. I also used a depth-1 Markov chain taken from the valid PDFs to further refine this, though I do not know if this actually improved the outcome.
Keep only 512 possibilities that are being actively pursued at any given time. Less likely possibilities would be moved to a separate list of up to 65536 possibilities which would only be reconsidered when the main list of possibilities dropped below length 512. The least likely possibilities would just be discarded in the process.
Limit the number of sequential inserted bytes to three, on the basis of more being needed being very unlikely.

Partial reconstructions of the page data from here could be substituted into an outer PDF template and viewed in Inkscape. Opening them in a PDF viewer did not always work; I did not yet realise why this was. This was sufficient to identify defects in the reconstructed page data, and correct them. This added an additional validation step of the prefix checker, filtering for possibilities whose output begins or begins with a particular (corrected or known to be correct) prefix.

Nonetheless, many of the pages would work swimmingly up to a point, following which they would abruptly fail to generate any further valid PDF data. Not being able to figure out what was going on (but presuming, correctly, that it had something to do with the occasional major differences in the hex diffs I mentioned earlier), I put the project on hold until I had more free time.

Hallucinated tags

Once I had more free time, I examined the aforementioned major differences more closely. What they were quickly became apparant: they were sequences which had appeared, to whatever software had corrupted the PDFs, to be actual HTML tags (in particular, the <> had not been escaped). These had been treated to:

Casemapping of the "tag name" to uppercase, treating as Latin-1.
Re-ordering of the "attributes" (which could have been separated by slash or space, and would now be separated by space).
Some use of decimal character references.
Insertion of quotation marks (single or double) around the part of an "attribute" after an equals sign.
Replacement of quotation marks in the "attribute name"s with question marks.

With these realisations, I was finally able to reconstruct the vector/text portions of the code page 354 chart. Fortunately, the raster graphics for the reference glyphs were last in the page data for any given page, so I did not need to reconstruct through them to get the information assigning GCGIDs to code points.

PdfSharp LZW bug

It eventually transpired that some oddities in the recovered PDF data, including Inkscape opening it and PDF viewers not opening it, arose from a bug in the PdfSharp-derived LZW decoder—specifically, when it encountered a LZW code equal to the current insertion index for the LZ78 dictionary, it would output the string for the previous LZW code, then add the concatenation of that string with its first character to the dictionary. I had been unable to work out why this functionality existed, since an equivalent result could be obtained by just repeating the previous LZW code.

When reading other sources about LZW compression, it came to my attention that, for an LZW code equal to the current dictionary insertion index, it was supposed to output, not only the string for the previous LZW code, but the concatenation of that string with its first character: i.e. the string that it was immediately poised to insert into the dictionary at the index referenced by that LZW code. Correcting this fixed the remaining oddities with the recovered PDF data.

This had thankfully not affected reconstruction, since the validation rules had been drawn up based on the data decompressed from the valid PDFs using the same, buggy LZW decoding logic.

Conclusions

I would appreciate never having to do this again.

Bizarre but relatively less destructive data corruption can be reversed if people care enough about that specific case to invest significant time into doing so.

Presumably, people don't use PdfSharp with LZW as opposed to Flate PDFs often enough for the bug to have been noticed.

The following data was recovered:

The eventually recovered code page 354 (preview it) bears far closer structual resemblence to code page 353 than to code pages 355 through 360. Since code page 353 is named "BCDIC-A", it is possibly "BCDIC-B", and in any case part of the "BCDIC-*" subfamily rather than the "PTTC/BCD" subfamily.