File Extractor - Inner workings

Some of the content here are the comments I added to my code.


Some of the terms used in this page:

TermMeaningRemarks
SPCThe Vision Factory, SPC-VisionThe original creators of The Apprentice
BMPBitMaPPlain uncompressed image format
ICENo idea ICE-packed compressed data. Used originally in Atari
PCMPulse Code Modulation Plain uncompressed audio format
ADPCMAdaptive Differential Pulse Code ModulationCompressed audio format
cADPCMCompressed Adaptive Differential Pulse Code Modulation Custom format for audio used by SPC: half the ADPCM information is added by code
CLUTColor Look-Up TableRemember those color-by-number paintings? That's how this image is stored.
DatablobPlain byte array
SectorSection of the IMG file2352 bytes in most cases, 2048 in case of a data sector

Global setup

Arguments / input parameters
Name(s) of file(s) to extract


Read file "The_Apprentice.img" in the same folder where the program is.
Find and build table of contents.

Visualization of table of contents:
Table of Contents
Cyan-colored values are directories. Blue-colored values are redirects to other directories.


Loop through all files in the list and process the selected ones.

Extract selected file

Arguments / input parameters
Full path of IMG file to extract selected file from
Selected file to extract
Root directory (contains list of all files) in case of requirement of secondary file


Set output path, get first sector number, file length, and filename.
Determine number of bytes per sector to use.

List of bytes per sector per file type:
File typebytes per sectorRemarks
.cda2352WAV audio (music)
.dat2048Data with error correction bytes
.rtf2352Real time files
If < 2352 skip first 24 bytes.


Determine nr of sectors to read (even if it's only partially used).
Read the full contents of 'n' sectors at once as unformatted datablob.
If required, get contents of secondary file too.

List of files that require a secondary file:
FilenameSecondary filenamePurpose
- treas1.datlevel1.datColor table for digits
- invaders.datlevel1.datColor table for all sprites
- level2.datlevel1.datAngel death sprites
- level3.datlevel1.datAngel death sprites
- level4.datlevel1.datAngel death sprites
- level5.datlevel1.datAngel death sprites
Angel death sprites are oddly specific eh?


Extract all assets from selected file.
Loop through assets and save them to output path.
Create 'extraction complete' marker file.

Extract assets

Arguments / input parameters
Full path of IMG file to extract selected file from
Datablob (byte array) to process
Filename
Datablob (byte array) of secondary file


There is an extract assets function per file. Only a few could be combined because they contain the same setup.

Function used per selected file:
Function nameList of applicable files
CDA track2.cda, track3.cda, track4.cda, track5.cda, track6.cda,
track7.cda, track8.cda, track9.cda, track10.cda, track11.cda,
track12.cda, track13.cda, track14.cda, track15.cda, track16.cda,
track17.cda, track18.cda, track19.cda, track20.cda, track21.cda,
track22.cda, track23.cda
Con_gfxcon_gfx.dat
Cr_gfxcr_gfx.dat
Go_gfxgo_gfx.dat
Hi_gfxhi_gfx.dat
Int_gfxint_gfx.dat
Intro1intro1.dat
Intro2intro2.dat
Intro3intro3.dat
Intro4intro4.dat
Intro5intro5.dat
Intro6intro6.dat
Invadersinvaders.dat
L1_eml1_em.dat
L2_eml2_em.dat
L3_eml3_em.dat
L4_eml4_em.dat
L5_eml5_em.dat
L6_eml6_em.dat
Level1level1.dat
Level2level2.dat
Level3level3.dat
Level4level4.dat
Level5level5.dat
Level6level6.dat
Levelblevelb.dat
Map map1_1.dat
map2_1.dat
map3_1.dat
map4_1.dat
map5_1.dat
map6_1.dat
Tit_gfxtit_gfx.dat
Treastreas1.dat
Unmentioned files are not extracted because they contain no unique assets


Although the list is large, there's basically only a few types of extraction functions:
- Extract contents of .dat file
- Extract contents of .cda file
- Extract contents of map[n]_1.dat file (Values of n: 1,2,3,4,5,6)
The exact details of each function differ, but the general setup of them are quite similar.

Extract assets: .dat

Arguments / input parameters
Filename
Datablob (byte array) to process
Secondary datablob (byte array) to process (optional)


The Vision Factory has given many of their .dat files its own table of content data.
This only contains the amount of blocks, and the size of each block.
To get the actual start of each block, the numbers have to be increased to be multiples of 2048, with a minimum value of 2048.
The offset of 2048 is present because the table of contents takes up at least 1 full sector.

Example:
Raw data (hex): 00 03 00 00 CB 6E 00 02 40 50 00 01 32 1F 00 00 00 00 .....
Nr blocks: 00 03 → 3

Block 0 start byte: 2048
Block 0 size: 00 00 CB 6E → 52078

2048 - (52078 modulo 2048) = 2048 - 878 = 1170
Block 1 start byte: 2048 + 52078 +1170 = 55296
Block 1 size: 00 02 40 50 → 147536

2048 - (147536 modulo 2048) = 2048 - 80 = 1968
Block 2 start byte: 55296 + 147536 + 1170 = 204002
Block 2 size: 00 01 32 1F → 78367


If required, do the same for the secondary file.
Now that the file has been divided in blocks, each block can be processed separately.
This is when the manual labor starts. Time to divide the block in separate datablobs, each containing one of the following:
- Color table data
- Single ICE packed CLUT image
- Single CLUT image
- Set of sprites
- Set of cADPCM data

Separation guidelines:
Data type Guidelines
Color table Most colortables are 128 colors and, with 3 bytes per color (RGB), are 384 bytes in size. Multiple tables can be present.
Indexed tables are 4 bytes per color (index, RGB) and usually preceded by a bank indexer: [C3 00 00 01] / [C3 00 00 02] / [C3 00 00 03]
ICE packed data One of the few formats with a definitive header: "ICE!" (49 43 45 21), followed by the size of the blob (4 bytes) and the size of the output (also 4 bytes).
Only found this format in "int_gfx.dat".
CLUT image Almost impossible to recognize without the use of human eyes, a 'view from afar' and the right bytes-per-row.
Something like this:
CLUT stored image in a hex editor
It's a labor intensive method, but apart from creating a fully working custom-designed undocumented file interpreter that may or may not even work for ALL files... I stuck to this method.
It's also required to determine the width of each CLUT image. It's not stored anywhere I looked.
Sprites Compiled sprites are usually preceded by a list of offsets (or indices) often ended by an RTS op-code (4E 75).
Each index points to one sprite, so count the indices and the RTS commands. Or eyeball it.
cADPCM cADPCM data is 120 bytes 'wide' when viewed from afar and is recognizable by its 'band' of different looking data. It's preceded by a small header detailing the number of sounds, and the length of each.
Like so:
cADPCM as viewed from afar


Once all the data is grouped, it's time to:
- Decompile the sprites to CLUT images
- Unpack the ICE images to CLUT
- Animate CLUT images with an active palette
- Decode CLUT images to BMP
- Decompress cADPCM to ADPCM
- Decode ADPCM to PCM
- Write the assets data (WAV, BMP) to files

Decompile sprites process:
It's basically a partial implementation of a CD-i microprocessor chip (the Motorola 68000) that runs the compiled sprites code.
The machine language is extremely low level, and I will not explain that here.

Compiled sprites are stored in three ways, often combinations of two or more:
- Pure instructions (directly set pixel numbers)
- Memory instructions (pixel numbers are stored somewhere in the program) IE pixel shifted
- Register instructions (often-used pixel numbers are stored in program-registers)

Most sprites end with a certain command (RTS operation code) and I keep a record of when these occur. The record then helps me split the sprites once all sprites are decompiled.
By measuring the contents of the resulting sprite-block (384px wide by actual height) I can determine the width.
Unpack ICE process:


This is a compressed data format. It unpacks from the end of the file to the beginning (rather unusual I think).
The following is what I think it does (as I re-wrote the C program to C#)
Based on instructions, it will either:
- Copy an 'x' number of bytes from the source to the destination (from packed to unpacked).
- Repeat a 'x' number of previously unpacked bytes 'y' times

Animate CLUT with active palette


Prior to decoding the CLUT image,an adjustment to the color table will 'animate' this.
Which color numbers need to be animated and with which colors in unknown (that is I can't find the exact code that does this)
Below is an example of how I did this:
Tower 3 conveyor belt The green cog has 3 frames

Not all animation is obvious:
Tower 1 window

Decode CLUT process:


The decoded data will get a bitmap header so it can be saved to file immediately.
For that a bit of math is required:

PixelArrayOffset = 14 + 40 14: BMP header size, 40:DIB header size
NrBytesPerLine = ((4 - ((Width * 3) % 4)) % 4) + (Width * 3) 3 bytes per pixel, padded to multiple of 4 bytes
NrBytes = NrBytesPerLine * Height
FileSize = PixelArrOffset + NrBytes

BMP file header
Offset (bytes)Size (bytes)ContentsDescription
002"BM" The header field used to identify the BMP and DIB file:
BM – Windows 3.1x, 95, NT, ... etc.
BA – OS/2 struct bitmap array
CI – OS/2 struct color icon
CP – OS/2 const color pointer
IC – OS/2 struct icon
PT – OS/2 pointer
024FileSizeThe size of the BMP file in bytes
0620Reserved; actual value depends on the application that creates the image
0820Reserved; actual value depends on the application that creates the image
104PixelArrayOffsetThe offset, i.e. starting address, of the byte where the bitmap image data (pixel array) can be found.
Source: Wikipedia BMP file format - BMP file header

DIB file header Directly follows the BMP header
14440The size of this header
184WidthThe bitmap width in pixels
224HeightThe bitmap height in pixels
2621The number of color planes
28224The number of bits per pixel
3040 (none)The compression method being used
344NrBytesThe size of the raw bitmap data in bytes
3842835The horizontal resolution of the image in pixels per meter
4242835The vertical resolution of the image in pixels per meter
4640The number of colors in the color palette, or 0 to default to 2n
5040The number of important colors used, or 0 when every color is important; generally ignored
Source: Wikipedia BMP file format - Windows BITMAPINFOHEADER


Now that the header is created, it's time to write the data!
The lines are stored bottom-to-top, the pixels are left-to-right, and each pixel is stored BGR (blue,green,red).
In pseudo-code that would look like this:

for (LineNr = 0 to Height)
{
  for (PixelNr = 0 to Width)
  {
    //get pixel data
    ColorNumber = ImageData[(Height - LineNr - 1) * Width + PixelNr]

    //get color
    //compiled sprites use 255 for transparency
    if (IsCompiledSprite and (ColorNumber == 255)) { Color = {blue=50, green=255, red=18} } //green screen
    else                                           { Color = ColorTable[ColorNumber] }      //regular color

    //apply color
    PixelStart = PixelArrayOffset + (LineNr * NrBytesPerLine) + (PixelNr * 3)
    BitmapData[PixelStart + 0] = Color[ blue]
    BitmapData[PixelStart + 1] = Color[green]
    BitmapData[PixelStart + 2] = Color[  red]
  }
}
Height: Height of image
Width: Width of image
ImageData: CLUT stored image data
ColorTable: The RGB values per color number

Decompress cADPCM process:

The cADPCM data has a header that contains:

Offset (bytes)Size (bytes)Contents
004Number of sounds (actual number is 1 less)
044 Coding data:
Bit-mask (hex)Values (hex)Usual valueDescription
FF0000-FF0Empty
008000-010Zero
004000-010Emphasis (0:off, 1:on)
003000-110Bits/Sample (00:4bits, 01:8bits, 10/11:reserved)
000C00-111Sampling Frequency (00:37.8kHz, 01:18.9kHz, 10/11:reserved)
000300-111Mono/Stereo (00:mono, 01:stereo, 10/11:reserved)
For each sound the header contains
Offset (bytes)Size (bytes)Contents
SoundNr*08 + 004Position (also add header offset)
SoundNr*08 + 042Number of soundgroups in sound
SoundNr*08 + 061Channel of the sound (0: left, 1: right)
SoundNr*08 + 071Zero (empty)

cADPCM header format Although two sounds start at [00 00 01 08], they are in different channels and can thus be extracted separately!


With the cADPCM header processed, the header and data of the sounds is next.
Each sound is made up of soundgroups. Each soundgroup consists of 8 parameter bytes and 112 databytes.
The parameter data of one channel has to be copied to the other channel to create stereo, then duplicated to match ADPCM format.
The data bytes of one channel only need to be copied to the other channel.
This process creates stereo sound, but the content of the channels is equal.
Two different sounds can be on the same position, just in different channels (here's where masking comes in).

Decode ADPCM process:
The details can be found here: CD-i forum.
Suffice to say that it's well documented there and I'm not about to re-write it here.


Due to the behavior of my program, there are additional steps for the compiled sprites:
- Separate decompiled sprite data into one sprite per datablob (I made a function that does this)
- Determine which sprites to ignore
- Determine which sprites should be concatenated/combined into a single image (must be sequential or ignored)
- Determine color table per sprite in case of multiple color tables
- Resize sprite to make animation easier

And let's not forget the special cases:
- Crumbling walls (single sprite needs to become multiple sprites)
- Locked doors (non-sequential, requires transparency-supported overlay)
- Split up sprites (non-sequential, but separated by required sprites)

Ignored sprite example:
Ignored sprite
Concatenation example:
Concatenation
Sequential concatenation
Resizing example:
Resizing sprites
Original Comparison between original and resized animation Resized
Crumbling walls process:
Crumbling wall
Combination of ignoring sprites, concatenation, and resizing.
Locked doors process:
Locked doors
Combination of ignoring sprites, concatenation, and transparency overlay.

Extract assets: .cda

Arguments / input parameters
Filename
Datablob (byte array) to process
CodingData (only used when converting ADPCM to PCM)


This function is used to extract a single PCM (plain WAV file) audio file from the datablob.
Therefor it's required to create a valid WAV header.
If the source was originally ADPCM or compressed ADPCM (cADPCM), there is additional information (namely coding data).
In that case, the samplerate and number of channels (mono or stereo) is determined by this data.
Else it's 44.1kHz and stereo.

WAV header construction information:

How-to: WAV header
Original source: here
Field name Contents
ChunkID "RIFF"
ChunkSize 38 + subChunk2Size
Format "WAVE"
Subchunk1ID "fmt "
Subchunk1Size 18
AudioFormat 1 PCM
NumChannels nrChannels
SampleRate sampleRate
ByteRate sampleRate * nrChannels * bitsPerSample / 8
BlockAlign nrChannels * bitsPerSample / 8
BitsPerSample 16
ExtraParamSize 0 Unused
Subchunk2ID "data"
Subchunk2Size nr bytes in datablob
data Datablob