News from Pate concerning His Dos Emulator for the DS:

Sorry, no new release of DSx86 today, as I have only been working on DS2x86 for the past two weeks. This porting work is progressing nicely, over half of the opcodes have been ported over to MIPS ASM. I have to mention, though, that the opcodes so far have been the easy ones (execpt the BCD opcodes), the more difficult opcodes like the string operations, shifts, INT and IRET, and port I/O are still ahead. These will take more time, and some of them will need some interfacing to the underlying hardware, so I can not just simply port them over from the ARM ASM code.

I am currently at opcode 0x8C, which is the mov r/m16,Sreg opcode, that is, moving a value from the segment register to memory or register. The problem above was caused by my tester code not yet supporting the FS and GS segment registers, while the CPU emulation already does this. So, every now and then I need to fix my tester program instead of the emulation code. :-)

Lazy Flags
Practically at the same time I started porting the opcode handlers from ARM ASM to MIPS ASM, I started thinking of ways to handle the Lazy Flags with the least amount of slowdown possible. Yesterday I figured out a method that is a little bit faster than the way I had when I started, so I spent a couple of hours refactoring all the opcodes I had already coded to use the new method. Too bad this did not occur to me earlier, but it is to be expected that I need to recode some parts of the code several times as I am still only learning the tricks in MIPS ASM.

I again used the DOSBox sources, together with the nice description at a blog post, to figure out how the lazy flags need to work. There are six flags that change after each arithmetic operation in the x86 architecture, some of which are simple and some more difficult to determine after the operation. The flags are:

Carry flag. This determines the unsigned overflow of the operation.
Adjust Flag. This is similar to Carry, but for the low 4 bits of the operation.
Overflow flag. This determines the signed overflow of the operation.
Zero flag. This determines if the result was zero.
Sign Flag. This determines if the result was negative.
Parity flag. This determines the number of bits set in the low byte of the result.

The simple flags are Zero, Sign and Parity. Zero flag is set if the result was zero, Sign flag is set if the highest bit of the result was set, and Parity flag can be set by a 256-item lookup table based on the low byte of the result. These three flags behave similarly to all opcodes (that change flags), so they can be determined simply by the result of the last operation. The other three opcodes behave differently in different opcodes, so based on the calculation operations in the DOSBox sources I combined a list of the different cases, to see how these need to be handled. DOSBox names the result and operands lf_resd, lf_var1d and lf_var2d (for doubleword operands), and I named them lf_res, lf_val1 and lf_val2 in my code.


Unknown, INC, DEC, MUL: return previous flag state
ADD: return (unsigned)lf_res < (unsigned)lf_val1;
ADC: return ((unsigned)lf_res < (unsigned)lf_val1) || (lflags.oldcf && (lf_res == lf_val1));
SBB: return ((unsigned)lf_val1 < (unsigned)lf_res) || (lflags.oldcf && (lf_val2 == 0xffffffff));
SUB, CMP: return ((unsigned)lf_val1 < (unsigned)lf_val2);
SHL, SHR, SAR, ROL, ROR, RCL, RCR: All have different handling
NEG: return lf_val1;
OR, AND, XOR, TEST, DIV: return false;


Unknown: return previous flag state
ADC, ADD, SBB, SUB, CMP: return ((lf_val1 ^ lf_val2) ^ lf_res) & 0x10;
INC: return (lf_res & 0x0f) == 0;
DEC: return (lf_res & 0x0f) == 0x0f;
NEG: return lf_val1 & 0x0f;
SHL, SHR, SAR: return lf_val2 & 0x1f;
OR, AND, XOR, TEST, DIV, MUL: return false;


Unknown, MUL: return previous flag state
ADD, ADC: return ((lf_val1 ^ lf_val2 ^ 0x80000000) & (lf_res ^ lf_val2)) & 0x80000000;
SBB, SUB, CMP: return ((lf_val1 ^ lf_val2) & (lf_val1 ^ lf_res)) & 0x80000000;
INC: return (lf_res == 0x80000000);
DEC: return (lf_res == 0x7fffffff);
NEG: return (lf_val1 == 0x80000000);
SHL: return (lf_res ^ lf_val1) & 0x80000000;
SHR: if ((lf_val2&0x1f)==1) return (lf_val1 > 0x80000000); else return false;
OR, AND, XOR, TEST, SAR, DIV: return false;

Based on these lists, it seemed to me that the Carry flag will be the most difficult and time-consuming to calculate. Besides the obvious conditional jump opcodes, there are many other opcodes (ADC, SBB, RCL, RCR, CMC) that need the current Carry flag value as their input. Also the shift opcodes change and use the Carry flag in various ways, so it seemed to me that using a switch statement -style code to calculate the Carry flag lazily whenever it is needed will really slow down those operations. So, I decided to see how much extra code I would need if I went for a direct Carry flag calculation in each of the opcodes. It turned out that most of the times it only takes one ASM operation to calculate the Carry flag after the operation, so this is how I currently handle the Carry flag.

I also noticed that if I calculate the Carry flag separately, I can fake the lf_val1 and lf_val2 values in opcodes like INC and DEC to give me the correct Adjust flag value when using the same calculation code as the normal ADD/SUB opcodes use. So I was able to simplify the Adjust flag calculation to the one case: ((lf_val1 ^ lf_val2) ^ lf_res) & 0x10. This just left the Overflow flag which needs separate cases for each opcode type. I use one of the MIPS general purpose registers to keep track of the last opcode type, along with registers for the last result and operands, so that the Overflow flag can be calculated lazily whenever needed. I hope to figure out some speedups for this as well, but for now it will have to do.

To show an example of the actual opcode handling and what the Lazy Flag handling requires, here is the handler for ADC r/m8,r8 opcode when the left operand is a memory address. In DS2x86 I decided to have #defines for all the registers I use for emulation, so I don't need to remember which MIPS register was which. I did not do this in DSx86, and that caused some wrong register usage from time to time.

.macro adc_effseg_reg8l reg
get_CF_into t3 // t3 = Carry flag value
li lf_type, OF_CALC_ADD | 24 // Remember the operation type and shift value for Lazy Flags
lbu lf_val1, 0(eff_seg) // Load the left operand from RAM
andi lf_val2, \reg, 0xFF // Remember the right operand for Lazy Flags
addu t3, lf_val1 // t3 = lf_val1 + Carry
addu lf_res, t3, lf_val2 // lf_res = lf_val1 + Carry + lf_val2
srl t0, lf_res, 8 // t0 = Carry value
sb lf_res, 0(eff_seg) // Save the result to RAM
andi lf_res, 0xFF // Remember only the low 8 bits for Lazy Flags
j set_carry_from_t0 // Back to loop

The get_CF_into macro looks like the following. It is a macro so that I can later change how the Carry flag is calculated without having to change all the code that uses it (just in case I still need to revert back to lazy calculation of the Carry flag). The set_carry_from_t0 code is immediately before the opcode loop handler, as many opcodes jump there to store the t0 register value back into the flags register lowest bit. When calculating the Carry flag immediately, Carry is simply the 8th bit of the result, so I can just shift it to the lowest bit of t0 register and don't need to handle the complex ((unsigned)lf_res < (unsigned)lf_val1) || (lflags.oldcf && (lf_res == lf_val1)) algorithm at all!
.macro get_CF_into reg
andi \reg, flags, 1

As you can see from this code, even just remembering the result and operands for later calculation of Lazy Flags takes a lot of code, in this case 4 of the 10 ASM operations are there just to get the later flags calculation to give correct result. When coding for the ARM ASM I did not need any of these, as the ARM can keep track of the flags by itself. Thus, DS2x86 will not be as much faster than DSx86 as the difference in the CPU clock speeds would make you think.