Introduction
This page is meant to provide some basic suggestions and strategies
for people who are starting out with reverse engineering old adventure
games, and aren't sure how to do it. It mainly focuses on resources and
tools for reversing DOS game executables, but much of the strategies
discussed may apply equally to other systems and debugging tools. This
is only intended as an overview; you'll still need to read other
resources to learn 8086 assembly language, and learn how to use the
various tools effectively.
Resources
IDA Disassembler
IDA is one of the best disassemblers available. And luckily, the
freeware version works with old DOS executables. Even more so, the
current freeware version supports viewing disassemblies in graph mode,
making it easier to see the overall flow of individual methods.
DosBox Debugger
The DosBox Debugger is an invaluable tool for running old DOS games, to
monitor how the program executes, and what values are generated by the
executing code.
XVI32 Hex File Viewer
Although IDA has a built in hex viewer for the executable itself, the
XVI32 tool is useful for viewing the contents of all the other files
that come with a game. There are many different freeware hex editors
available, so any other can be used just as easily.
Ralf Brown's Interrupt List
A nice reference for the operation of DOS interrupts. In 8086 assembly,
apart from directly accessing ports, using interrupts is the primary
means of accessing system functionality such as opening files, changing
graphics modes, and many other things.
8086 Assembly Language
For those new to 8086 assembly language, you'll need a handy reference
to learn the syntax. The Wikipedia is a good starting point, but you can
also simply Google for an introduction as well.
Using the DosBox Debugger
It's up to the individual if you want to use a debugger when reverse
engineering a program. Some prefer a more cerebral challenge of only
figuring out code execution using a decompiler tool, whereas others may
find using a debugger useful for figuring out what values are passed to
functions. I would recommend using a debugger particularly when
reversing a game for the purpose of adding ScummVM support. When you
start implementing code to implement game functionality, once you've got
portions of the game disassembled, it can be immensely useful for
tracking down bugs. Particularly if you initially write your code with
names that closely match the names you give the methods in the
disassembly.
For debugging purposes, if the game is a DOS game, the DosBox
Debugger is the best tool I've found for executing and debugging DOS
programs. The default distribution of DosBox doesn't have it enabled,
but you can either compile DosBox with it enabled, or download a
previously compiled executable. See the
DosBox Debugger Thread for more information.
One of the biggest initial steps when using the DosBox debugger
is matching addresses in executable at run-time with your disassembly in
IDA. This can be done either from the debugger to IDA, or from IDA to
the debugger:
From the debugger to IDA
This is the easiest. If you break execution of the game at any point,
you can simply use the Find Binary option in IDA to search for a
sequence of bytes from the instructions shown in the DosBox Debugger
disassembly area. Be careful to pick instructions that aren't far calls
or jumps - such instructions are modified when a program loads depending
where it loads in memory, so the IDA disassembly won't have the exact
same bytes. If you do find a match, double check that the offset within
the segment of the found match in IDA matches the offset of the
instructions in the DosBox Debugger. If not, you may have found a false
match, and should either search for the next occurrence, or specify
extra bytes in your search until you find the correct match.
From IDA to the Debugger
If you have a point in the IDA disassembly and want to figure out
what address it will be loaded in the DosBox Debugger, it's also not
hard. This is presuming the game in question doesn't use Overlays.
Overlays were a method developed when games and other applications
became too big to fit into memory at once. In these cases, code for the
program is often stored at the end of the executable, or in separate
files, and loaded as needed into part of the memory, overwriting
previously loaded code. In such situations, it becomes hard to pin down a
specific section in memory a given segment will be loaded at, since it
may shift around in memory over the course of the program running, as it
gets loaded, overwritten, and loaded again repeatedly.
So long as the game doesn't use overlays, the following steps can be used:
- look at the IDA view to find out the current file offset at the
bottom of the screen. You'll quickly find it if you try selecting
different instructions, since it will keep changing. Now:
- Get the value from the beginning of the current code segment.
This is just to make the calculations easier, since the start of the
segment will have an instruction offset between 0h and 0Fh, which means
it won't be messing with
our segment calculations
- Get the value from the beginning of the entire disassembly.
- Drop the last digit from both values, and get the difference between the two.
- For executables run in DosBox, add a value of '0138h'. For COM files, add a value of '0128h'.
This will give you the segment address of where the segment
should be under DosBox. In either case, it's generally a good idea is to
then rename the current segment in the IDA disassembly so that it
includes the actual segment address of where it was loaded in DosBox.
For example, the first segment of executables is normally loaded
at segment 0138h in memory, so you might rename the segment 'sg0138'.
That way, if you later want to set a breakpoint in the DosBox Debugger
for any instruction in the segment, you will immediately know what the
segment is.
Using IDA Effectively
One of the best things to do when disassembling a game is to document everything. Particularly method parameters and structures.
Naming Methods
Methods can be renamed using the general 'N' hotkey (as well as via
the menus), and the 'Y' can be used to specify a C-like prototype for a
method. This is particularly useful when some of the parameters for a
method are passed using registers. By explicitly documenting what the
method expects, it makes it easier to remember later on when you're
reversing methods that call it. Standard methods where parameters are
passed via the stack are easy, since IDA can automatically set up the
function prototype for you. If a method does have parameters passed in
registers, prototypes like the below can be used:
int __usercall sub_100FB<ax>(__int8 param1<al>, int param2<bx>)
In this case, the method takes an 8-bit parameter in the al register,
and another 16-bit value in bx, then returns a result in ax
Using Structures
The other thing you'll need to learn to use IDA effectively is the
use of structures. Irrespective of what language a game was originally
written in, there will always be structures containing related
information. It may be something as simple as a C-style struct, or could
even be the fields of a class in C++.
When dealing with data, you'll frequently see cases like
mov bx, 30h
mul bx
mov ax, [bx+2D00h]
In this case, an initial index in the ax register is multiplied by
30h (30 hexadecimal = 48 decimal). So from this we can determine that
the given structure is 48 bytes in size, and can create a new structure
accordingly. For smaller sized structures, you may want to create as
many 2 byte word fields as needed to make up the correct size for the
structure. For larger sizes, the easiest way is to simply declare an
array of the needed structure size - 1, and follow it with a single byte
field. You can then delete/undefine the array. The remaining byte will
keep the structure at the correct size, and you can then later fill in
the fields as you find references to them.
Secondarily, the value of '2D00h' indicates an offset in the data
segment, representing the rough starting address of the first element
of the given array in memory. Here we run into a minor problem. The
offset of '2D00h' may not indicate the precise start of the array. If
the code in question wanted to get the value at offset 8 in the
structure, then the array may actually start 8 bytes earlier in memory,
at address '2CF8h'.
In such cases, the only way to tell for sure is to start
searching for immediate values in the program of values bytes backwards
at a time, until you can't find any more values. For example, if you
find references in the code of values '2CFFh', and '2CFEh', each with
previous multiplications by 30h/48, but none for '2CFDh', '2CFDCh', or
'2CFBh', then you can probably be confident that the array starts at
offset 2CFDh.
Once that's determined, you can then create a dummy structure of
the correct size, and convert the given address of 2CFDh to an instance
of that structure type. Until you're more familiar with the range of
values the original array index may be, it'll likely be easier to simply
leave the defined array with the single index. Later on, you can always
change the structure in memory to specify how many elements it has
later on.
Remember that fields in structures can vary in size, so it's
always possible you'll get the starting address wrong. In which case,
you may have to later on correct the address of the structure in the
data segment. This will affect any fields you figure out as well. In the
above example, if you mistakenly presumed the array started at offset
2CFDh, then 2D00h would be thought to be a field at offset 3 in the
structure (2CFDh + 3 = 2D00h, as per the above example code fragment).
However, if the array structure really starts at 2CF8h, then the same
field should be at offset 8 within the structure (2CF8h + 8 = 2D00h). So
you need to rebuild the list of fields you'd figured out in the
structure, since they'll all be at the wrong position. Overall, it's
better when encountering an array to spend the extra time to ensure
where it starts in memory so you don't need to fix offset problems later
on.
Disassembly Strategies
One of the hardest things when starting work on a new disassembly is
to figure out how to begin. The following are offered as suggestions of
how to get started in the disassembly process.
File Access
One of the easiest places to start a disassembly is generally by
identifying file accesses. Using IDA, you can, for example, do a text
search for 'open', 'read', 'close', etc. to find occurrences of file
opening. IDA provides standard comments for many operating system calls,
so even in a new disassembly you should be able to locate such calls by
their comment text. Likewise for file reading, writing, and closing.
Normally, a program will encapsulate these calls into a method of it's
own, so your first disassembly step can be in identifying the methods
and naming them appropriately with names like 'File_open', 'File_read',
and so on. Likewise, giving the passed parameters an appropriate name.
In IDA, the 'Y' command can be used to set up an appropriate method
signature for methods. By properly naming the method and it's
parameters, this will help you in all the methods that call those
methods.
For example, if a read method has a 'size' parameter and a
'buffer' parameter, then if a method that calls it passes '200' for the
size, and a reference from a location on the stack, you can be confident
that the stack entry can be called something like 'readBuffer', and use
the '*' (array size) key when looking at the Stack View (Ctrl-K) to set
the size of the array to 200 bytes.
You should hopefully then be able to start working on methods
that call the file access functions and hopefully start decoding them.
Some examples:
1. If the game consists of only a few large data files, the
methods that call the open/read/close functions may a resource manager
responsible for loading subsets of the file. In which case, the methods
may likely load some kind of index into memory and then have a separate
'get resource' method that scans through the list for a resource with a
given Id, resulting in a specific portion of the data file being read.
In this case, you can identify all the methods with appropriate names
like 'ResourceManager_init', 'ResourceManager_loadIndex',
'ResourceManager_getResource', and so on.
The DosBox debugger may prove useful when dealing with games
using large resources. In DOS, Interrupt 21h is one of the primary
system interrupts. Specific command Ids are passed in AH, and the other
registers are set with values depending on which function is being
called. For example, command 42h of Interrupt 21h is the command for
seeking within a file.
Try using 'BPINT 21 42' to put a breakpoint on any calls to seek system
function. By clearly identifying the 'Seek method', you can then step
out of that routine to find what called it. Hopefully, you can then
examine the logic in the disassembly used to generate the file offset to
help you figure out how file offsets are generated for specific
resources, and from that figure out how the resource's index works.
Remember that game resource managers not only typically merge
multiple individual resources into one single bigger file, they
frequently also compress them as well, to save space and prevent people
from seeing textual resources when viewing the contents of the file. In
such cases, if you can figure out the strategy used for extracting
single resources, it may be worthwhile taking the time to code a
standalone program to extract and, if necessary, decompress single
resources into separate output files. That way, you can more easily look
at individual resources that are used by the game without having to
worry about manually locating them in the archive/resource file.
2. If the game consists of many different files, it's likely the
game will be manually calling the open/read/close methods whenever it
wants to access a particular file/resource.
In either case, figuring out the file access routines will give
you an excellent start into figuring out the contents of the game; you
can then move onto methods that call the resource manager get resource
method, and start looking at what kind of resources are loaded, and from
there start identifying methods that make use of those resources.
Graphics access
Another place to get started on the disassembly is the graphic draw
routines, those responsible for copying raw pixels to the screen
surface.
Graphic display was complicated in the early PC days by different
modes for the different graphics cards writing to memory in different
ways. In the Monochrome/Hercules mode, for example, 8 pixels are stored
per 8-bit byte. In EGA, the addressing can be complicated by how the
display is configured - the same areas of memory may be used to
represent different parts of pixels - with the part of a pixel being
updated depending on specific values sent to hardware ports. Finally, of
them all, the most common 320x200x256 colour mode is the easiest to
deal with, with each pixel taking up a single byte.
For most of the graphics modes, you can look at them in a similar
manner - as a block of data in memory starting at offset A000h:0. Only
the number of bytes per line will vary, depending on what the graphics
mode is. Assembly routines that deal with the graphics screen will
typically have code to figure out screen offsets based on provided x and
y parameters, so it will frequently be easy to identify the parameters
and figure out how the screen offsets work. For example, in 320x200x256
MCGA mode, an offset on the screen will be calculated using the formula
(y * 320) + x.
For finding the graphic routines you have two options:
The first is to entirely use IDA, and simply search for immediate
values of 'A000h'. Since this is the area of memory that graphics are
commonly displayed in, it can be a quick way to locate graphic routines.
The other alternative is to use the DosBox Debugger. It has a use
command called 'bpm' that allow you to set a memory breakpoint, which
then gets triggered if the given memory address changes. So you could do
'bpm A000:0' to set a breakpoint on the first byte of the screen memory
(i.e. the top left hand corner of the screen). Then whichever routine
modifies it first will trigger the breakpoint. Using the previously
discussed techniques, you can find the same place in your IDA
disassembly, and look into reversing that method first.
It will be likely that related functions will be next to each
other, so once you've looked into the given identified function, you may
also be able to review previous or following functions to see if they
have identifiable graphic routines.
Data Segment strings
The strings in the data segment can be an excellent source for
identifying the purposes of various methods. If you're very lucky, there
may be error messages that contain the name of the function as part of
the error. In which case, you can then find out what code references it,
and name the methods appropriately.
Note: IDA is good, but it's not perfect., It's not always
guaranteed to be able to figure out that a given value loaded into a
register somewhere in the program is for a reference into the data
segment. As such, if the cross-reference command doesn't give any
references for a given string, try searching for an immediate value of
the offset of the string. Chances are, any reference you find is likely
pointing to the string. You can then use the 'O' command to change the
operand from an immediate value to instead point to the offset in the
data segment.
Even if any error messages don't contain method names, an error
message can prove invaluable. For example, an error message like "Unable
to initialise mouse" tells you that whatever method uses it is setting
up the mouse for access, so could be given a name like
'initialise_mouse', or 'initialise_events'.
Likewise, have a look at the context of what needs to happen for
the error message to be printed, since the message can give you insights
into what is being done. A message like "No more free inventory slots"
tells you that the code that references this error message is likely a
routine for adding an item to the inventory (hence the error if no more
slots are available). From there you might be able to identify the area
of memory containing the inventory list, and then cross reference other
methods that also access it - you could end up identifying a whole
group of methods related to inventory manipulation.
Another possibility for data segment strings is to pick strings
that may be descriptive enough to guess their purpose. For example, if a
string has a name like 'FNT20', or 'UI.FNT', it's likely that it's a
font file, containing the images for each character for use in
displaying on the screen. In that case, any code that references it is
likely to be passing it to a 'font_open' function, which loads up the
font. If you can disassemble that method, you may be able to determine
how the loaded font is stored in memory. From there, you can use the
cross references function of IDA to find other methods that use the same
memory address as where the font is stored, and which will likely give
you the methods for actually displaying text on the screen.
From there, you may be able to go even further, and start
figuring out methods that call the 'write string' function.. menus,
hotspots, conversation handlers, and so forth.
Program execution
Another method for identifying methods may be simply executing the
program itself. When running the program in DosBox, you may find it
useful to start stepping through the main procedure to see what happens
from stepping over every method. IDA may be helpful in identifying the
main procedure, but even if not, most programming languages have a
series of method calls for setting up the initial application state,and
then a single final call to the 'main' method.
With the main method identified, stepping over each method may
produce interesting results. For example, stepping over a single method
call may run all the code for showing the game's introduction sequence
before returning control to the debugger. If this can be identified, you
could name the method appropriately. You may then be able to gleam
information from how the method is called and for the method itself:
For where the method is called, there may be conditional checks
to see whether the method is called or not. For example, some games may
have a stored settings file to flag whether the introduction has been
shown, and not show it again after the first time the game is played..
In that case, a method call just prior to the method call to show
the introduction may be for reading in the game settings and checking
the flag for whether the introduction has been shown. Knowing this, you
could name multiple methods for reading settings, then within it file
opening and reading, and so on.
Within the method itself, you may likewise be able to figure out
further specific details of how the method is implemented from the
method calls it makes. For example, a 'play introduction' method may
consist of calling the same method multiple times with different
parameter values. These values may well be offsets within the data
segment for resource or file names for specific animations to run. In
which case, you now know the sub-method is an animation player, and name
it accordingly. You could then start work on the animation player,
figuring out how it loads it in data, and what method it uses to build
up and display graphics on the screen.
Particularly for cases like that, identifying and naming the
graphic/screen methods may be helpful, since you could work the
disassembly from both the front end reading the animation, and from the
low level drawing of the graphics of the animation.
Final Words
Reverse engineering a game can be a rewarding experience, but plan to
spend a lot of time working on it. Reverse engineering a game can take
months, even years for really complicated games. As you gradually start
figuring out sections of the original game, it may prove helpful to dive
into creating the engine in a fork of the ScummVM code. Re-implementing
things such as resource managers, and file access, it may help you
figure out the purpose of other methods more easily if you can see what
the code you implement does, and thus get a better idea of what kind of
data will be passed to other methods you haven't yet figured out.