DS86 News - Dos Emulator for Nintendo DS - LOADFIX, EGA and opcode work [Archive] - DCEmu Network: The Homebrew, Hacking & Gaming Network

wraggster

January 17th, 2010, 21:51

Pate has posted some WIP News concerning his Dos Emulator for DS:

Today I got fed up with the Packed File Corrupt problem, and decided to see if I could handle the required "LOADFIX" functionality automatically within DSx86. I looked at a couple of games that cause this behaviour, and used a hex editor and debugger to see if I can find a pattern in their header that would detect the use of the buggy /EXEPACK linker switch that cause this problem. I found the following things in common with the EXE headers of the problem games:

RelocationItems is zero.
HeaderSize is 0x20 (512 bytes) even though the actual header is only 28 bytes.
Initial SP is 0x0080.
Initial IP is either 0x0010 or 0x0012.
Practically all EXE packers have zero RelocationItems, so that alone is not a sufficient indicator of a buggy EXEPACK code, but I didn't find all of the above header settings in any packed EXE file that don't have the "Packed File Corrupt" problem.

I made a small code change into DSx86 where it detects the above EXE header signature, and allocates 64KB of RAM before allocating the memory for the program and running it. I also added new code to the FreeProcessMemory() code so that when the process exits it checks whether an extra block of memory was allocated and frees that as well when freeing the actual memory of the process. The end result was that the programs that have that EXE header signature only get 580KB of RAM, and don't give the "Packed File Corrupt" message any more.

I might need to adjust this detection algorithm in the future, to check also the start of the actual code (which seems to always be nearly identical to this):

1C59:0012 8CC0 MOV AX,ES
1C59:0014 051000 ADD AX,0010
1C59:0017 0E PUSH CS
1C59:0018 1F POP DS
1C59:0019 A30400 MOV [0004],AX
1C59:001C 03060C00 ADD AX,[000C]
1C59:0020 8EC0 MOV ES,AX
1C59:0022 8B0E0600 MOV CX,[0006]
1C59:0026 8BF9 MOV DI,CX
1C59:0028 4F DEC DI
1C59:0029 8BF7 MOV SI,DI
1C59:002B FD STD
1C59:002C F3 REPZ
1C59:002D A4 MOVSB
1C59:002E 8B160E00 MOV DX,[000E]
1C59:0032 50 PUSH AX
1C59:0033 B83800 MOV AX,0038
1C59:0036 50 PUSH AX
1C59:0037 CB RETF

Detecting this code before actually loading the EXE into memory is slightly more difficult than just looking at the EXE header, so I hope the current change will fix at least most of the problems.

Galactic Battle
I had a plan to work this weekend on adding support for EGA, specifically mode 0x0D (320x200 with 16 colors). Last week I searched for a small game that would use that mode, and I downloaded a couple of games that weren't suitable (had a lot of extra files) until I found Galactic Battle which seemed to be just what I was looking for. It is a small Space Invaders clone that uses mode 0x0D and PC Speaker sounds.

It did have the "Packed File Corrupt" problem, and I didn't have the above fix in the code at that time, but that was easy to work around by starting 4DOS without swapping. Anyways, this Friday I then began working on emulating the 16-color mode. During the last week I had been thinking about ways to emulate this mode, and I did come up with a solution that I thought might work, so I began coding it. I managed to get the code working pretty well already on Saturday, and I took the screen copy above from the Galactic Battle running in DSx86. It was quite playable, perhaps just a little bit slow.

EGA emulation
The 16-color modes are a lot more complex than any graphics mode I have so far supported. MCGA 320x200 with 256 colors is the easiest, as each pixel is simply a byte that is an index to a palette, exactly like the bitmapped background modes in Nintendo DS. The CGA mode was a little bit more complex, as it has 2 bits per pixel, but that was easily handled via a look-up table (LUT).

However, EGA and VGA 16-color modes are a different beast entirely. They use four separate memory planes, a byte in each plane has 8 neighbouring pixels, and each plane contains one bit of the 4-bit color. All these four planes share the same memory address (in segment 0xA000), and the plane that is being read/written is determined by writing a certain mask to a certain EGA/VGA I/O register. Thus, writing for example 4 neighbouring pixels of different colors might need 4 writes to the same memory position, and most likely also four reads and some bit masking so that the other 4 pixels in the same bytes don't get overwritten. A pretty complex scenario to emulate (and especially to do it fast!).

The real EGA has 4 times 64KB planes totalling 256KB of RAM, and I didn't want to spend more than this 256KB of RAM in my emulator. I also didn't want to assign less than this amount of RAM to the EGA/VGA memory, as many games use page flipping and assume that this much memory is available. So, I needed some way to make the memory behave like four different planes of 64KB each, but I also needed a way to blit this fast into the Nintendo DS VRAM, which is organized as 8 bits per pixel (that is, each byte is a separate pixel).

The straightforward method to emulate this might have been to allocate 64KB of RAM for each of the four planes, and use these planes like the original EGA/VGA uses them, each byte contains 8 pixels and the combination of the planes having a pixel set would determine the output color. However, I thought that building each output pixel (while blitting the screen) from a single bit in four different memory locations would pretty much kill the performance. There would be no practical way to use the ldmia opcode to load several words from the source buffer, and splicing each input bit to a separate output byte sounded like a really slow operation as well.

The idea I had during the week was that perhaps I could swap the way the memory is organized in the emulated EGA/VGA memory. I wanted to have all data that is needed to build an output pixel as close together in the source RAM as possible, and I also wanted the source data (when blitting) to have at least some resemblance to the output byte-per-pixel organization. So, I thought that keeping the 4 bits that are used to determine the color together might make everything faster. In the real EGA/VGA it takes 32 bits to contain 8 pixels, and I could also fit 8 pixels into 32 bits (a word) even if I used 4 bits for each pixel. So, in my current implementation each byte is actually a word, and each bit is actually a 4-bit color value.

I use a LUT to convert from an input byte (for example during a write to EGA/VGA RAM using a stosb opcode) to a word, which is then masked with a write mask based on a value written to the EGA/VGA register that controls which planes are accessed when a byte is written to EGA/VGA VRAM.

To make this emulated RAM fast to blit to the screen, I also interleaved the pixel positions so that the 4-bit pixels in a word are organized as 73625140. That allowed me to easily reorganize the word to two separate words containing 4-bit pixels 03020100 and 70605040 (or 8-bit pixels numbered 3210 and 7654) which can then be written to Nintendo DS VRAM. I copied the EGA palette to all the 16 16-color blocks so that I don't even need to clear the extra bits from the bytes, I can write the data as-is like ?3?2?1?0 and ?7?6?5?4 (after a shift right by 4 bits).

This a snippet of the blitting code. I read 4 words and write them to 8 words, as each 4-bits-per-pixel input value is converted to an 8-bits-per-pixel output value:

ldmia r1!, {r3,r5,r7,r9} @ Load 4 words = 4*8 = 32 pixels
mov r10, r9, lsr #4
mov r8, r7, lsr #4
mov r6, r5, lsr #4
mov r4, r3, lsr #4
stmia r0!, {r3-r10}

Here is an illustration of the changed memory layout. Hopefully you can make sense of it, it is pretty difficult to explain clearly.

Memory organization
-------------------

EGA/VGA:
- Pixels in bits of a byte: 01234567 (where leftmost is the highest bit)
- Colors:
- Plane 0: BBBBBBBB (each bit set means the corresponding pixel has a blue component)
- Plane 1: GGGGGGGG (each bit set means the corresponding pixel has a green component)
- Plane 2: RRRRRRRR (each bit set means the corresponding pixel has a red component)
- Plane 3: IIIIIIII (each bit set means the corresponding pixel has an intensity component)

DSx86:
- Pixels in bits of a word: 77773333666622225555111144440000
- Colors in bits of a word: IRGBIRGBIRGBIRGBIRGBIRGBIRGBIRGB

Example 1, setting two middle pixels to bright white:
-----------------------------------------------------

Input byte: 0x18 = 0b00011000 (highest bit is the leftmost pixel on screen)
Write Mask: 0x0F = 0b1111 (all color components active)

Original EGA memory result:
Plane 0 (blue): 0b00011000
Plane 1 (green): 0b00011000
Plane 2 (red): 0b00011000
Plane 3 (intensity): 0b00011000

DSx86 memory result:
Emulated RAM: 0b00001111000000000000000011110000

Example 2, setting the two leftmost pixels to red:
--------------------------------------------------

Input byte: 0xC0 = 0b11000000 (highest bit is the leftmost pixel on screen)
Write Mask: 0x04 = 0b0100 (red color component active)

Original EGA memory result:
Plane 0 (blue): 0b00000000
Plane 1 (green): 0b00000000
Plane 2 (red): 0b11000000
Plane 3 (intensity): 0b00000000

DSx86 memory result:
Emulated RAM: 0b00000000000000000000010000000100

Opcode work
During the week I added many of the missing modrm bytes. I started from the beginning (opcode 0x00 = ADD r/m8,r8) and systematically added every single modrm variation. I am currently at opcode 0x38, all the smaller-numbered opcodes have all their modrm variations handled. This was mostly copy/paste work, as for example all and, or and xor opcodes behave exactly the same, only the actual operation in my opcode handlers differ.

This copy/pasting meant that the size of my CPU emulation source code increased quite a bit. Currently it has 46.507 rows and is about 1.38 megabytes in size. In version 0.02 it had 35.287 rows and was 1.08 MB, so it has grown by over 10.000 rows since then, and I still have a lot of modrm variations to add. I'm starting to worry about possible macro or label limits in the GNU Assembler, but I can of course split the file to several smaller files if needed. I'd like to keep the file as a single entity, though, as it currently has only a few well-defined dependencies to other files.

http://dsx86.patrickaalto.com/DSblog.html