DSx86 News - Dos Emulator for DS - Refactoring and profiling [Archive] - DCEmu Network: The Homebrew, Hacking & Gaming Network

wraggster

July 25th, 2010, 21:16

Pate has posted more news of his work on a Dos emulator for the DS

Internal refactoring done
This Wednesday I finally got all the internal refactoring done, and was able to do the remaining big architectural changes (that needed all the code to be refactored before I could do them). It took me a while to get the code working again, but by Wednesday evening the code finally seemed to run properly. I checked the CPU speed with Norton SysInfo to see how much the code has slowed down, and the result was pretty worrying.

I decided to see if I could include my old profiling tool (which I had last used in August last year, when the emulation core still was bundled with Wing Commander II files and could not even run 4DOS yet), and make it run with the current DSx86. I had long since stripped the code out from DSx86, but luckily I found it in an old source code backup directory. I added it back, and decided to start by profiling Norton Sysinfo.

Profiling Norton Sysinfo
Initial profiling results
Here is the first profiling result, while running the SysInfo CPU speed test. This was taken pretty much right after I finally got the new code to run properly, without any optimizing done yet. The first table shows the opcodes taking the least average number of ticks (ordered by that value), and the second table shows the opcodes taking the most total number of ticks (again ordered by that value):

opcode byte count min ticks avg ticks total ticks % of total command in ITCM?
NOP 90 1742 8 8.00 13936 0.0082% No operation Yes
CWD 99 78 10 10.00 780 0.0005% Convert word to doubleword Yes
JA 77 4394 10 10.01 43992 0.0260% Jump if unsigned above Yes
MOV DL,imm8 B2 52 11 11.00 572 0.0003% Move imm8 byte to DL register Yes
JL 7C 518056 11 11.01 5704348 3.3762% Jump if signed less Yes

opcode byte count min ticks avg ticks total ticks % of total command in ITCM?
JL 7C 518056 11 11.01 5704348 3.3762% Jump if signed less Yes
ADD r16, r/m16 03 519175 14 28.11 14591630 8.6363% Add 16-bit register or memory to 16-bit register No
ADD/SUB/AND/OR/CMP r/m16,imm16 81 518383 32 32.14 16660587 9.8608% Various 16-bit arithmetic/logical operations with imm16 value No
TEST/NOT/NEG/MUL/DIV r/m16 F7 266630 16 67.06 17880567 10.5829% Various 16-bit memory operations No
MOV r/m16,r16 89 525262 33 37.18 19526937 11.5573% Store a 16-bit register into register or memory No
MOV r16,r/m16 8B 806241 14 24.63 19859369 11.7541% Load a 16-bit register from register or memory Yes
POP r/m16 8F 518064 39 39.16 20285222 12.0061% Pop a value from stack to register or memory No
INC/DEC/CALL/JMP/PUSH r/m16 FF 1048576 34 39.20 41102557 24.3272% Various 16-bit memory operations No

Not surprisingly, NOP (no operation) is the fastest opcode. The ticks run at 33MHz, so 8 ticks means that handling a NOP opcode takes 16 CPU cycles (as the NDS CPU runs at 66MHz). This includes some profiling overhead, so one or two ticks can in effect be decremented from the ticks of all the opcodes to calculate the actual amount of timer ticks the opcode executing takes. The JL opcode is both one of the fastest opcodes and also one of the most frequently executed opcodes. It is interesting that the two most common opcodes are 0xFF and 0x8F, both of which should be rather uncommon in normal programs, especially in games. As opcodes 0x81, 0xF7 and 0xFF can perform several different operations, depending on the so called "modrm" byte following the main opcode byte, I wanted to see what operations exactly SysInfo does with those opcodes:

opcode 81 modrm count min ticks avg ticks total ticks % of total command ITCM?
CMP [disp16],imm16 3E 577 32 95.10 54873 0.1678% Compare global variable with 16-bit value No
CMP [BP+disp8],imm16 7E 1047999 32 32.10 33640850 99.8322% Compare local variable with 16-bit value No
opcode F7 modrm count min ticks avg ticks total ticks % of total command ITCM?
DIV [BP+disp8] 76 204 78 148.66 30326 0.0431% Divide DX:AX by local variable No
IMUL [BP+disp8] 6E 714 29 77.13 55073 0.0783% DX:AX = AX * local variable No
DIV CX F1 2244 67 83.05 186355 0.2651% Divide DX:AX by CX No
DIV BX F3 1043068 67 67.01 69896106 99.4304% Divide DX:AX by BX No
opcode FF modrm count min ticks avg ticks total ticks % of total command ITCM?
PUSH [disp16] 36 225 35 71.23 16027 0.0390% Give a global variable as a parameter to a C function No
INC WORD [disp16] 06 8159 39 41.91 341920 0.8326% Increment a global variable No
PUSH [BP+disp8] 76 520954 38 38.10 19847174 48.3317% Give a local variable as a parameter to a C function No
INC WORD [BP+disp8] 46 518732 40 40.16 20831544 50.7288% Increment a local variable of a C function No

Opcode 81 only used those two variations, with only the CMP [BP+disp8],imm16 operation actually relevant. Opcode F7 used several modrm variations, but again only DIV BX is called repeatedly in the CPU speed test loop. I believe this opcode is used to determine the CPU MHz number, as the DIV opcode is supposed to take exactly 22 CPU cycles on a 80286 processor. As the division seems to take 67 ticks (at 33MHz) in DSx86, that will nicely convert to 11MHz 80286 clock speed, just like Norton SysInfo reports. Opcode FF (together with 81 and 8F) are then probably the actual opcodes used to calculate the CPU speed, and all of these use the BP-register-indexed stack access.

It is also interesting that the very rarely called operations take on the average two times the minimum timer ticks, while the common operations take around the minimum number of ticks all the time. This is probably due to NDS cache misses while the less frequent operations are performed.

Optimizing opcode 03 (ADD r16,r/m16)
Next I looked into the operation that I thought is most suitable for testing the possible optimization tricks, opcode 03 (ADD reg16, r/m16). This is a good opcode for testing, as it does not need the CPU flags to be saved (the addition will change all of them anyways), and pretty much all the arithmetic and logical opcodes are very similar. So if I can figure out ways to optimize it, I can use the same tricks for a lot of other opcodes as well. The refactored code for opcode 0346 (ADD AX,[BP+disp8]) looked like this (the actual code is full of parameterized macros, so this is what the code would look like with all the macros expanded), and it takes 28.11 ticks on the average to run:

add_ax_bpdisp8:
@-------
@ macro r0high_from_idx_disp8
@-------
ldrsb r0,[r12],#1 @ Load sign-extended byte to r0, increment r12 by 1
add r0, r9, r0, lsl #16 @ r0 = (idx register + signed offset) << 16
b add_r16_r0high_bp_r4 @ Jump to handler for AX (r4) register with BP (r9) based indexing
...
@-------
@ macro add_reg16_r0high
@ On input:
@ r0 = offset within the segment in high halfword
@ r1 = free
@ r2 = current effective segment in high halfword, segment override flag in lowest byte
@ r3 = current SS segment in high halfword, current DS segment in low halfword
@ r4..r11 = AX..DI registers in high halfwords
@ r12 = current physical CS:IP
@ lr = current physical SS:0000
@-------
add_r16_r0high_bp_r4: @ This is jumped to when the offset is based on BP register
@-------
@ macro mem_handler_bp_destroy_SZflags
@ Indexing by BP register, so use SS unless a segment override is in effect.
@-------
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
moveq r2, r3, lsr #16 @ r3 high halfword contains the SS segment, so put it into r2 ...
lsleq r2, #16 @ ... and shift it to the high halfword.
@-------
@ macro mem_handler_jump_r0high
@ Calculate the physical RAM address, and jump to correct handler
@ depending on the type of the memory addressed.
@ On input:
@ r0 = offset within the segment in high halfword
@ r2 = current effective segment in high halfword
@ NOTE! Nothing may have been pushed into stack before this!
@ Output:
@ r2 = physical memory address (with EGA/MODEX flags if applicable)
@ Destroys:
@ r0
@-------
add_r16_r0high_r4: @ This is jumped to when the offset is NOT based on BP register
add r2, r0, lsr #4 @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
mov r0, r2, lsr #(12+10+4) @ r0 = 16K page number
add r0, #(SP_EMSPAGES>>2) @ r0 = index into EMSPages table in stack
ldr r0,[sp, r0, lsl #2] @ r0 = physical start address of the page, highest byte tells type
lsl r2, #(18-12) @ r2 = offset within the 16K page in highest bits
add r2, r0, r2, lsr #18 @ r2 = physical linear address
add r0, pc, r2, lsr #24 @ r0 = PC + 0x02, 0x06, 0x0A, 0x0E, ...
ldr pc,[r0, #-2] @ Jump to the handler, adjust index to 0, 4, 8, or 12
.word .op_03_RAM_r4 @ RAM (physical address like 0x02XXXXXX)
.word .unknown_back1 @ MCGA Direct (obsolete!)
.word op_03_EGA_r4 @ EGA (physical address like 0x0AXXXXXX)
.word .unknown_back1 @ Mode-X (unsupported opcode!)
.op_03_RAM_r4:
@-------
@ Actual code for handling opcode 03 when the target is AX and the address is in normal RAM.
@ Get a halfword from (possibly) unaligned memory address, and add it to register.
@-------
ldrb r0, [r2] @ Load low byte from RAM
ldrb r1, [r2, #1] @ Load high byte from RAM
lsl r0, #16
orr r0, r1, lsl #24 @ r0 = low byte | (high byte << 8) (in high halfword)
adds r4, r0 @ Finally perform the addition
b loop

I did a minor optimization immediately, I coded a shortcut for the situation where the memory operand is in normal RAM (which it always is in SysInfo), and then checked which operations exactly are performed:

opcode 03 modrm count min ticks avg ticks total ticks % of total command ITCM?
ADD SI,AX F0 312 14 39.47 12314 0.0468% Add register AX to register SI No
ADD AX,BX C3 364 14 36.27 13201 0.0502% Add register BX to register AX No
ADD DI,[BP+disp8] 7E 208 25 67.85 14112 0.0537% Add a local variable to register DI No
ADD AX,[disp16] 06 364 24 68.85 25061 0.0953% Add a global variable to register AX No
ADD AX,[BP+disp8] 46 1045983 25 25.03 26180287 99.5774% Add a local variable to register AX No

The things to note about this opcode are:

All the time is spent practically in a single operation, adding a 16-bit value in the stack frame of the function to the AX register (which is the code shown above). All other modrm variations are called only a few hundred times, while this operation is called a million times.
My coding a shortcut for quickly handling the RAM operation decreased the number of ticks from 28.11 to 25.03, so this was a worthwhile change.
The less common operations really suffer from the cache misses, with their average ticks over two times their minimum ticks.
I wanted to make sure that the fluctuating ticks for the less common opcodes really is caused by the cache misses, so I experimented by moving all the register modrm operations (modrm >= 0xC0) into ITCM, and jumping directly to their handler if the modrm byte is >= 0xC0 (else I jump to the original handler). This made the average time for the register operations to be exactly 14 ticks, so this proved that the additional time is caused by cache misses. Sadly there is not enought ITCM to put all register operations there, so my best strategy is trying to optimize the extra jumps away from as many operations as possible.

Final result
I made several iterations, adjusting and improving the code and then profiling again. Finally I ran out of new ideas to test, so this is what the current code looks like. See the list of optimizations below the code for a description of each change I did. The changes are also marked in red in the comments in this code snippet:

add_ax_bpdisp8:
@-------
@ new macro r0high_r2_from_bpdisp8_destroy_SZflags
@-------
ldrsb r0,[r12],#1 @ Load sign-extended byte to r0, increment r12 by 1
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
add r0, r9, r0, lsl #16 @ r0 = (idx register + signed offset) << 16
biceq r2, r3, #0x0000FF00 @ r2 = logical SS segment in high halfword, with garbage in low byte
@-------
@ macro calc_linear_address_r2_from_r0high
@-------
add r2, r0, lsr #4 @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
mov r0, r2, lsr #(12+10+4) @ r0 = 16K page number
add r0, #(SP_EMSPAGES>>2) @ r0 = index into EMSPages table in stack
ldr r0,[sp, r0, lsl #2] @ r0 = physical start address of the page minus logical page start
add r2, r0, r2, lsr #12 @ r2 = physical linear address
@-------
@ Code specific to [BP+disp8] handling
@-------
tst r2, #0x7C000001 @ Is the target something else than halfword-aligned RAM?
bne .op_03_addr_r4 @ Yep, so jump there
@-------
@ Halfword-aligned RAM address accessed by BP-based indexing.
@-------
ldrh r0, [r2] @ Load halfword from RAM
adds r4, r0, lsl #16 @ Add it to register value
b loop @ Back to opcode loop

The optimizations I made to the code are the following:

I removed the jump from the add_ax_bpdisp8 opcode handling to the main add_r16_r0high_bp_r4 handler, and inlined the address calculation to the opcode handler itself. This makes the code somewhat larger, but for these most often used opcodes it is a good tradeoff.
I coded a new macro r0high_r2_from_bpdisp8_destroy_SZflags which is specific to the BP+disp8 handling. I interleaved the r0 and r2 register handling to avoid using a register immediately after it is loaded (which my original macros did), and I also realized that I can leave garbage to the low byte of r2 while making the high halfword contain the SS (Stack Segment) value. Thus I could get rid of the extra shift to clean the bottom halfword, and just use the bic command to copy the high halfword from r3 to r2 and clear the bits that would go into the top 20-bit address area. I cleared the whole second byte just for the sake of tidiness. :-)
I changed the SP_EMSPAGES table in stack to contain the physical base addresses minus the logical segment address, so I could get rid of the extra cleanup shifting after the lds operation. This change affected all of the refactored code, not just this BP+disp8 handling. However, since now I have to use the r0 register immediately after it is loaded, this did not much improve the performance, but it does make the code smaller, at least.
I added a tst r2,#0xFC000000 operation to test for the r2 register high bits (which determine the memory access mode) and only jump to the memory address dispatcher if the high bits tell that the memory address is not in RAM. For this ADD reg16,r/m16 opcode the only game I have so far run into that uses graphics memory is LHX Attack Chopper, and even that did not use the ADD AX,[BP+disp8] version, so the jump to the dispatcher is in practice never taken.
After I had added the previous test it occurred to me that the BP register is normally used to access local variables in stack, and the stack is normally halfword-aligned. So perhaps I could test the r2 address for halfword-alignment using a tst r2,#0xFC000001 operation, with no extra performance penalty! I have always wanted to use ldrh and strh instead of the byte loads/stores when the address is halfword aligned, but I have always felt that the extra test for alignment will kill the possible performance advantage, especially when the address is randomly either aligned or not. But in this situation the address is likely to be aligned, and the test would not cost anything extra in terms of CPU cycles. After some testing I noticed that tst r2,#0xFC000001 will always set the Carry flag (as the last bit rotated is set), so I changed it to tst r2,#0x7C000001 which will always clear the Carry flag, so that I can use it in logical operations (which are supposed to clear the Carry flag) as well.
Finally, after the test for halfword alignment, I could replace the two separate byte loads with a single halfword load, and perform the addition. If the address happens not to be halfword aligned (or a program adds to a register based on a value in graphics memory), the code jumps to the slow generic handler at .op_03_addr_r4.

So, after I coded similar optimizations to all the [BP+disp8] based operations that Norton SysInfo uses during the CPU speed calculation, how did this affect the speed? Here first is the new profiling result, where we can see that handling opcode 03 now takes on the average only 22.13 timer ticks (while it originally took over 28 ticks):

opcode byte count min ticks avg ticks total ticks % of total ITCM? improvement
JL 7C 518073 11 11.01 5703131 4.2727% Yes 0%
ADD r16, r/m16 03 519191 14 22.13 11490927 8.6088% No 21.25%
MOV r/m16,r16 89 525284 24 24.25 12736016 9.5416% No 34.78%
CMP [BP+disp8],imm16 81 518356 25 25.12 13019844 9.7543% No 21.85%
POP r/m16 8F 518085 28 28.06 14539076 10.8924% No 28.32%
MOV r16,r/m16 8B 806280 14 20.04 16157616 12.1050% Yes 18.64%
DIV BX (etc) F7 266636 16 67.04 17875969 13.3924% No 0%
INC/PUSH [BP+disp8] (etc) FF 1048576 27 30.25 31721006 23.7649% No 22.82%

The operations that read from RAM (now using ldrh instead of two ldrb operations) have improved about 20%. The operations that write to RAM (this time with strh instead of two strb operations) have improved by about 30%! (The real improvement is even a little bit higher, as these percentages have the profiling overhead included in the results.) It is interesting that the memory store benefits from halfword access more than the load. Perhaps this is due to my not being able to avoid using the register immediately after load, while storing a register does not have this slowdown. And finally, here is what Norton SysInfo now shows as the speed of DSx86. I was hoping I could get back to above 10x original PC speed, and I am quite happy to see that I succeeded. All in all, looks like my refactoring the code did not completely kill the performance of DSx86.

The next program I am going to profile is Wing Commander II, as it has been pretty choppy to begin with. The last time I profiled it the MCGA graphics mode only used Direct screen access, while nowadays only blitted screen update is used. Thus the results will not be fully comparable to results from last year, but even so it will give me information on what opcodes to optimize next.

http://dsx86.patrickaalto.com/DSblog.html