DS2x86 Protected Mode work [Archive] - DCEmu Network: The Homebrew, Hacking & Gaming Network

wraggster

October 17th, 2010, 21:30

More Dos Emulator for DS News from Pate:

This past week has seen slow but steady progress with the protected mode support in DS2x86. After the previous blog post, I started thinking about ways to make the existing real-mode code more compatible with the needs of protected mode, and 32-bit memory access. In the original DSx86 code I had used all the 16-bit registers shifted to high 16 bits, and I used the lowest byte of the currently effective segment (which I kept in r2 register) to tell whether a segment override is in effect. Most of the opcodes need to know whether a segment override is in effect to calculate the correct memory address when using BP-based indexing. While memory access normally defaults to Data Segment (DS), addressing memory with the BP register defaults to Stack Segment (SS). In DSx86 I kept the currently effective segment register in r2 highest 16 bits, with the lowest byte telling whether a segment override is in effect. The SS register value was kept in the high 16 bits of r3 register, with the DS register value in the low 16 bits of the same register. Thus, in the main loop I could easily reset the segment override to be off and r2 register having the default DS register value like this:

ldrb r1,[r12],#1 @ Load opcode byte to r1, increment r12 by 1
mov r2, r3, lsl #16 @ r2 high halfword = logical DS segment, clear segment override flags
ldr pc,[sp, r1, lsl #2] @ Jump to the opcode handler

Can't get much more efficient than that, when trying to perform two logically different operations, making r2 contain the currently effective segment and clearing a segment override flag. The BP-based memory address handling in turn checked whether a segment override is in effect and if not, made the r2 register contain the current SS value with the following code:
.macro mem_handler_bp_destroy_SZflags
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
biceq r2, r3, #0x0000FF00 @ r2 = logical SS segment in high halfword, with garbage in low byte
.endm

I had used a somewhat similar approach in DS2x86, as it was just copied and translated to MIPS assembly from the DSx86 method. To prepare for protected mode, I wanted to change this method so that the register that keeps the currently effective segment (#defined to be "eff_seg", in reality register ra) directly contains a linear memory address (which in real mode would be the segment value shifted left 4 bits). So, I could not use the same trick of storing the segment override flag in this register. I really did not want to make the code slower than it currently was, so I actually spent two days just thinking how I could change the segment override flag handling so that the main loop would not slow down (my first priority), I would not need to waste a new register for just this flag (second priority), and that the BP register memory access would also be as fast as possible (third priority).

After spending two days thinking about this problem, the solution finally occurred to me. In the end the main loop did not get any slower, I did not need to use a new register, and the BP addressing was just as fast as before! Here is the resulting code, with some explanation following.

lw t1, opcode_table(t1) // Get the opcode handler address from the opcode table
move eff_seg, eff_ds // Set DS to be the effective segment
ori flags, FLAG_SEG_OVERRIDE // Fix the CPU flags, telling we have no segment prefix
jr t1 // Jump to the opcode handler

After assembling, the generated code looks like this:
8006453c: 8d290000 lw t1,0(t1)
80064540: 01e0f821 move ra,t7
80064544: 01200008 jr t1
80064548: 37390002 ori t9,t9,0x2

The MIPS assembler does a lot of changes to the original ASM code behind the scenes, due to the peculiar features of the processor. For example, all jumps and branches have a "branch delay slot" following them, which is actually executed before the branch is taken. The assembler reorders the opcodes so that the jump is moved before the preceding opcode, if the preceding opcode (ori in my example) has no effect on the branch instruction itself (which it does not here). If the jump can not be moved higher, then a NOP operation is added into the branch delay slot, wasting one CPU cycle. Also, as loads from memory (the lw opcode) cause a pipeline stall if the register that is loaded is used in the next opcode, you also lose a CPU cycle if you don't have any useful operations (that do not use the loaded register) to put immediately after the load opcode. Thus, there is no way to make the main loop code faster than what it currently is, so my first priority was fulfilled. I need to have one operation after the branch address loading, and I need to have an operation in the branch delay slot.

I managed to fulfill my second priority by using the x86 CPU flags emulation register (#defined as "flags", being in reality register t9). The x86 flags register has a reserved bit 1 (with value 2) that should always be set. I set this bit in the main loop, and then reset the bit to zero in all segment override handlers. Since the code that would need to use the full flags register value (practically only the PUSHF opcode handler) will never have a segment override, this will cause no problems in any code that handles the flags register.

The macro to handle the BP-register based segment handling looks like the following. The .set commands allow me to use the Assembler Temporary (AT) register myself, while normally the assembler uses this for all sorts of behind-the-scenes tricks and macro expansions.

.macro mem_handler_bp
.set noat
andi AT, flags, FLAG_SEG_OVERRIDE // at == 0 if we have a segment override
movn eff_seg, eff_ss, AT // If no segment override, put SS into effective segment
.set at
.endm

This is just as efficient as the original DSx86 code, just two assembler opcodes. The andi opcode puts just the flags bit 1 into the AT register (so AT is zero if the flag is not on, meaning a segment override is in effect), and the movn opcode moves eff_ss register into eff_seg register if the AT register is not zero (no segment override in effect). This fulfilled my third priority.

In addition to this change I changed all my memory address routines to not use shifted memory offsets, which was a lot of work. There were 266 locations in the code where the shift was used, but only about 220 of these were related to this address calculation and needed changing. I first used a simple find-replace operation in the editor to comment all of these out, and then used my tester program to see which opcodes got broken, and then fixed these one by one. In the end the whole code got about 3% faster! Not a big change, but it was very nice that adding a new feature made the code faster, and not slower as normally happens!

After such extensive code refactoring I finally got back to debugging the PMODE header of Trekmo in DS2x86. The PMODE header first goes to 16-bit protected mode (it jumps to a USE16 segment using the jmp 0020:138E opcode as you saw in the debug output of the previous blog post). Then it sets up the Interrupt Descriptor Table (IDT) while in the USE16 segment, and then goes to 32-bit protected mode (USE32 segment) using a RETF opcode. It took me the rest of last week to add support for the operations PMODE does in the USE16 segment, so that finally today I got DS2x86 to run the RETF opcode properly and switch to the USE32 segment. This is where I am currently at. There is only a small amount of code remaining in the PMODE header until it jumps to my own Trekmo code (jmp 00014ED4 in the debug output, which is the jmp _main command in the following code snippet from the PMODE sources).

I also hacked my debugger memory dump routines so that by dumping address FFFF:30 I can get a formatted output of the Global Descriptor Table (GDT). The GDT that PMODE uses is shown below. You can see that for example selector 20 is a USE16 code segment, while selector 08 is the actual USE32 segment (where the RETF opcode returned to, and where I am currently at). In this case PMODE uses a GDT with a limit of 0x8F (so that all the items happen to fit nicely into the DS2x86 debug screen) and located at linear address 0x000042C4.

p_start: ; common 32bit start
mov eax,gs:[1bh*4] ; neutralize crtl+break
mov oint1bvect,eax
db 65h,67h,0c7h,6 ; MOV DWORD PTR GS:[1bh*4],code16:nullint
dw 1bh*4,nullint,code16 ;
mov eax,gs:[32h*4] ; set up for new real mode INT32
mov oint32vect,eax
db 65h,67h,0c7h,6 ; MOV DWORD PTR GS:[32h*4],code16:int32
dw 32h*4,int32,code16 ;
in al,21h ; save old PIC masks
mov ah,al
in al,0a1h
mov oirqmask,ax
jmp _main ; go to main code

The next big thing to do is to add proper protected mode interrupt handling using the IDT table, and I also need to improve my stack handling so that switching between 16-bit SP and 32-bit ESP stack pointer addressing works properly. Currently it is somewhat hardcoded to just work in the current situation in PMODE/Trekmo. Besides those features, I still have a lot of new opcodes to add, so these will again keep me busy for quite a while.

http://dsx86.patrickaalto.com/DSblog.html