Pate has posted more news of his work on a Dos emulator for the DS
Internal refactoring done
This Wednesday I finally got all the internal refactoring done, and was able to do the remaining big architectural changes (that needed all the code to be refactored before I could do them). It took me a while to get the code working again, but by Wednesday evening the code finally seemed to run properly. I checked the CPU speed with Norton SysInfo to see how much the code has slowed down, and the result was pretty worrying.
I decided to see if I could include my old profiling tool (which I had last used in August last year, when the emulation core still was bundled with Wing Commander II files and could not even run 4DOS yet), and make it run with the current DSx86. I had long since stripped the code out from DSx86, but luckily I found it in an old source code backup directory. I added it back, and decided to start by profiling Norton Sysinfo.
Profiling Norton Sysinfo
Initial profiling results
Here is the first profiling result, while running the SysInfo CPU speed test. This was taken pretty much right after I finally got the new code to run properly, without any optimizing done yet. The first table shows the opcodes taking the least average number of ticks (ordered by that value), and the second table shows the opcodes taking the most total number of ticks (again ordered by that value):
opcode byte count min ticks avg ticks total ticks % of total command in ITCM?
NOP 90 1742 8 8.00 13936 0.0082% No operation Yes
CWD 99 78 10 10.00 780 0.0005% Convert word to doubleword Yes
JA 77 4394 10 10.01 43992 0.0260% Jump if unsigned above Yes
MOV DL,imm8 B2 52 11 11.00 572 0.0003% Move imm8 byte to DL register Yes
JL 7C 518056 11 11.01 5704348 3.3762% Jump if signed less Yes
opcode byte count min ticks avg ticks total ticks % of total command in ITCM?
JL 7C 518056 11 11.01 5704348 3.3762% Jump if signed less Yes
ADD r16, r/m16 03 519175 14 28.11 14591630 8.6363% Add 16-bit register or memory to 16-bit register No
ADD/SUB/AND/OR/CMP r/m16,imm16 81 518383 32 32.14 16660587 9.8608% Various 16-bit arithmetic/logical operations with imm16 value No
TEST/NOT/NEG/MUL/DIV r/m16 F7 266630 16 67.06 17880567 10.5829% Various 16-bit memory operations No
MOV r/m16,r16 89 525262 33 37.18 19526937 11.5573% Store a 16-bit register into register or memory No
MOV r16,r/m16 8B 806241 14 24.63 19859369 11.7541% Load a 16-bit register from register or memory Yes
POP r/m16 8F 518064 39 39.16 20285222 12.0061% Pop a value from stack to register or memory No
INC/DEC/CALL/JMP/PUSH r/m16 FF 1048576 34 39.20 41102557 24.3272% Various 16-bit memory operations No
Not surprisingly, NOP (no operation) is the fastest opcode. The ticks run at 33MHz, so 8 ticks means that handling a NOP opcode takes 16 CPU cycles (as the NDS CPU runs at 66MHz). This includes some profiling overhead, so one or two ticks can in effect be decremented from the ticks of all the opcodes to calculate the actual amount of timer ticks the opcode executing takes. The JL opcode is both one of the fastest opcodes and also one of the most frequently executed opcodes. It is interesting that the two most common opcodes are 0xFF and 0x8F, both of which should be rather uncommon in normal programs, especially in games. As opcodes 0x81, 0xF7 and 0xFF can perform several different operations, depending on the so called "modrm" byte following the main opcode byte, I wanted to see what operations exactly SysInfo does with those opcodes:
opcode 81 modrm count min ticks avg ticks total ticks % of total command ITCM?
CMP [disp16],imm16 3E 577 32 95.10 54873 0.1678% Compare global variable with 16-bit value No
CMP [BP+disp8],imm16 7E 1047999 32 32.10 33640850 99.8322% Compare local variable with 16-bit value No
opcode F7 modrm count min ticks avg ticks total ticks % of total command ITCM?
DIV [BP+disp8] 76 204 78 148.66 30326 0.0431% Divide DX:AX by local variable No
IMUL [BP+disp8] 6E 714 29 77.13 55073 0.0783% DX:AX = AX * local variable No
DIV CX F1 2244 67 83.05 186355 0.2651% Divide DX:AX by CX No
DIV BX F3 1043068 67 67.01 69896106 99.4304% Divide DX:AX by BX No
opcode FF modrm count min ticks avg ticks total ticks % of total command ITCM?
PUSH [disp16] 36 225 35 71.23 16027 0.0390% Give a global variable as a parameter to a C function No
INC WORD [disp16] 06 8159 39 41.91 341920 0.8326% Increment a global variable No
PUSH [BP+disp8] 76 520954 38 38.10 19847174 48.3317% Give a local variable as a parameter to a C function No
INC WORD [BP+disp8] 46 518732 40 40.16 20831544 50.7288% Increment a local variable of a C function No
Opcode 81 only used those two variations, with only the CMP [BP+disp8],imm16 operation actually relevant. Opcode F7 used several modrm variations, but again only DIV BX is called repeatedly in the CPU speed test loop. I believe this opcode is used to determine the CPU MHz number, as the DIV opcode is supposed to take exactly 22 CPU cycles on a 80286 processor. As the division seems to take 67 ticks (at 33MHz) in DSx86, that will nicely convert to 11MHz 80286 clock speed, just like Norton SysInfo reports. Opcode FF (together with 81 and 8F) are then probably the actual opcodes used to calculate the CPU speed, and all of these use the BP-register-indexed stack access.
It is also interesting that the very rarely called operations take on the average two times the minimum timer ticks, while the common operations take around the minimum number of ticks all the time. This is probably due to NDS cache misses while the less frequent operations are performed.
Optimizing opcode 03 (ADD r16,r/m16)
Next I looked into the operation that I thought is most suitable for testing the possible optimization tricks, opcode 03 (ADD reg16, r/m16). This is a good opcode for testing, as it does not need the CPU flags to be saved (the addition will change all of them anyways), and pretty much all the arithmetic and logical opcodes are very similar. So if I can figure out ways to optimize it, I can use the same tricks for a lot of other opcodes as well. The refactored code for opcode 0346 (ADD AX,[BP+disp8]) looked like this (the actual code is full of parameterized macros, so this is what the code would look like with all the macros expanded), and it takes 28.11 ticks on the average to run:
add_ax_bpdisp8:
@-------
@ macro r0high_from_idx_disp8
@-------
ldrsb r0,[r12],#1 @ Load sign-extended byte to r0, increment r12 by 1
add r0, r9, r0, lsl #16 @ r0 = (idx register + signed offset) << 16
b add_r16_r0high_bp_r4 @ Jump to handler for AX (r4) register with BP (r9) based indexing
...
@-------
@ macro add_reg16_r0high
@ On input:
@ r0 = offset within the segment in high halfword
@ r1 = free
@ r2 = current effective segment in high halfword, segment override flag in lowest byte
@ r3 = current SS segment in high halfword, current DS segment in low halfword
@ r4..r11 = AX..DI registers in high halfwords
@ r12 = current physical CS:IP
@ lr = current physical SS:0000
@-------
add_r16_r0high_bp_r4: @ This is jumped to when the offset is based on BP register
@-------
@ macro mem_handler_bp_destroy_SZflags
@ Indexing by BP register, so use SS unless a segment override is in effect.
@-------
tst r2, #0xFF @ Is a segment override in effect? Zero flag will be set if not
moveq r2, r3, lsr #16 @ r3 high halfword contains the SS segment, so put it into r2 ...
lsleq r2, #16 @ ... and shift it to the high halfword.
@-------
@ macro mem_handler_jump_r0high
@ Calculate the physical RAM address, and jump to correct handler
@ depending on the type of the memory addressed.
@ On input:
@ r0 = offset within the segment in high halfword
@ r2 = current effective segment in high halfword
@ NOTE! Nothing may have been pushed into stack before this!
@ Output:
@ r2 = physical memory address (with EGA/MODEX flags if applicable)
@ Destroys:
@ r0
@-------
add_r16_r0high_r4: @ This is jumped to when the offset is NOT based on BP register
add r2, r0, lsr #4 @ r2 = full logical linear memory address in highest 20 bits, garbage in low byte
mov r0, r2, lsr #(12+10+4) @ r0 = 16K page number
add r0, #(SP_EMSPAGES>>2) @ r0 = index into EMSPages table in stack
ldr r0,[sp, r0, lsl #2] @ r0 = physical start address of the page, highest byte tells type
lsl r2, #(18-12) @ r2 = offset within the 16K page in highest bits
add r2, r0, r2, lsr #18 @ r2 = physical linear address
add r0, pc, r2, lsr #24 @ r0 = PC + 0x02, 0x06, 0x0A, 0x0E, ...
ldr pc,[r0, #-2] @ Jump to the handler, adjust index to 0, 4, 8, or 12
.word .op_03_RAM_r4 @ RAM (physical address like 0x02XXXXXX)
.word .unknown_back1 @ MCGA Direct (obsolete!)
.word op_03_EGA_r4 @ EGA (physical address like 0x0AXXXXXX)
.word .unknown_back1 @ Mode-X (unsupported opcode!)
.op_03_RAM_r4:
@-------
@ Actual code for handling opcode 03 when the target is AX and the address
...
Catherine: Full Body’s English translation for the Vita