Pate has posted more WIP news concerning his Dos Emulator for the Nintendo DS:
Well, the big architectural change in the next version is that I finally got the self-modifying version of my IRQ handling code to work! This is something that I have wanted to do pretty much since the start of this project. I tried to use SMC back in December last year, but I could not make it work reliably at that time. Now at the beginning of my summer vacation I decided to look into this again, and now it looks like I got it to work reliably! The rewritten IRQ handling I did at the end of May probably helped to make this work, as the IRQ code was now in one place and easier to change.
Ever since the beginning the main opcode dispatcher loop of DSx86 has looked like this:
loop:
ldr r1,[sp, #SP_IRQFLAG] @ Get the IRQFlag value (set to 0 if we need to handle an IRQ, else 0xFF)
ldrb r0,[r12],#1 @ Load opcode byte to r0, increment r12 by 1
mov r2, r3 @ Clear segment override, r2 = r3 = physical DS:0000
and r1, r0 @ AND the opcode value with the IRQFlag value, result is either just the opcode or 0
bic r9, #0xFF @ Clear segment override flags from low byte of r9
ldr pc,[sp, r1, lsl #2] @ Jump to the opcode handler (or opcode 0 if r1 == 0)
@ ------------------- 00 = ADD r/m8, r8 -------------------------------
op_00:
@-------
@ Check if we are to handle an interrupt instead of op_00
@-------
mrs r1,cpsr @ Save flags to r1
cmp r0, #0 @ Do we really need to handle opcode 00 instead of an IRQ?
bne IRQStart @ Nope, we need to handle an IRQ instead.
@-------
@ Handle opcode 00. No need to restore processor flags, as the following "adds" will change them anyways.
@-------
...
That is, I first read a mask from the stack which states whether the code should jump to IRQ handler next or keep on handling opcodes. Then I read the opcode byte, then make the effective segment register r2 point to the start of the DS segment, then mask the opcode byte with the IRQ mask, then clear the flags that tell whether we had a segment override prefix, and then load the program counter from the opcode handler address table (in stack), which causes a jump to the opcode handler. If the SP_IRQFLAG is zero, the jump goes to the opcode 0x00 handler, which first checks whether the opcode actually was a zero, and jumps to the IRQStart handler if it wasn't
The annoying thing in this code is that I need to perform a memory read for the IRQ mask and a masking operation for every single opcode, even though the IRQs happen extremely rarery (from the CPU speed point of view). That is 4 extra CPU cycles that go to waste for every single opcode. Really annoying and frustrating when coding an emulator that should run as fast as possible.
However, now that I finally managed to make the SMC version robust, the main opcode loop looks like this:
loop:
ldrb r1,[r12],#1 @ Load opcode byte to r1, increment r12 by 1
mov r2, r3 @ Clear segment override, r2 = r3 = physical DS:0000
bic r9, #0xFF @ Clear segment override flags from low byte of r9
.global SM_IRQFLAG
SM_IRQFLAG: @ SELF_MODIFIED CODE!
ldr pc,[sp, r1, lsl #2] @ Jump to the opcode handler (or load r0 register if IRQ)
b IRQStart @ Jump to IRQ handler if we did not jump above.
I got rid of the IRQFlag in the stack and the mask operation, so the main dispatcher loop does not have any code relating to IRQ handling. Instead, I replace the opcode at the SM_IRQFLAG address with a different one when the code needs to jump into IRQStart. The two opcodes that can be at SM_IRQFLAG are as follows:
#define IRQ_ON 0xE79D0101 @ ldr r0,[sp, r1, lsl #2]
#define IRQ_OFF 0xE79DF101 @ ldr pc,[sp, r1, lsl #2]
That is, the opcode loads r0 register (which is used as a scratch register in DSx86) instead of the program counter when an IRQ handling should start, so the program flow continues to the "b IRQStart" branch instruction. The IRQStart routine then restores the IRQ_OFF opcode to SM_IRQFLAG and then performs other stuff needed when beginning an IRQ handling. This change removed the 4 extra CPU cycles from the handling of every opcode, so DSx86 became 8% faster than before just by this small change!
As you might remember, at the end of May the Norton Sysinfo displayed the DSx86 speed as 10.6 times original PC. After this change the speed is up to 11.5 times original PC, which is even faster than the original Nov 12th, 2009 Sysinfo measurement of 11.3 times original PC on real hardware and 11.6 on No$GBA. At that time the code was still missing most of the current features, so it is no surprise it ran much faster then that it has been running recently before this SMC change.
Game-specific fixes
I have been following the compatibility wiki closely, it is very interesting and
...
Catherine: Full Body’s English translation for the Vita