Page 1 of 16 1234511 ... LastLast
Results 1 to 10 of 160

Thread: Daedalus R13 Progress Report

                  
   
  1. #1
    DCEmu Old Pro Wally's Avatar
    Join Date
    Oct 2005
    Posts
    1,170
    Rep Power
    75

    Default Daedalus R13 Progress Report

    StrmnNrmn stopped by his blog to post an interesting article about R13 which includes things like trampolines and nice speed ups.

    To sum it all up he said he got close to a 15% speed up.

    Heres the article:

    Dynarec Improvements

    I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.

    At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)

    I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.

    The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.

    These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.

    The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):



    ADDIU a0 = s1 + 0x0024 # add offset to base register
    SLT t0 = (a0<s6) # compare to upper limit
    ADDU a1 = a0 + s7 # add offset to emulated ram
    BNEL t0 != r0 --> cont # valid address?
    LW s0 <- 0x0000(a1) # load data
    J _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
    NOP
    cont:
    # s0 now holds the loaded value,
    # or we've exited from dynarec with an exception


    There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.

    So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.

    Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.

    When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.

    What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:


    ADDIU a0 = s1 + 0x0024 # add offset to base register
    SLT t0 = (a0<s6) # compare to upper limit
    BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
    ADDU a1 = a0 + s7 # add offset to emulated ram
    LW s0 <- 0x0000(a1) # load data
    cont:
    # s0 now holds the loaded value,
    # or we've exited from dynarec with an exception
    #
    # rest of fragment code follows
    # ...


    _Trampoline_XYZ123:
    # handler returns control to 'cont'
    J _HandleLoadStore_XYZ123
    NOP


    The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.

    As with the first optimisation, this change yielded a further 3-5% speedup.

    The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:


    8018e014 LB t8 <- 0x0000(a1)
    8018e018 LB t9 <- 0x0000(a0)
    8018e01c ADDIU a0 = a0 + 0x0001
    8018e020 XOR a2 = t8 ^ t9
    8018e024 SLTU a2 = (r0<a2)
    8018e028 BEQ a2 == r0 --> 0x8018e038
    8018e02c ADDIU a1 = a1 + 0x0001
    8018e038 LB t0 <- 0x0000(a0)
    8018e03c NOP
    8018e040 BEQ t0 == r0 --> 0x8018e058
    8018e044 NOP
    8018e048 LB t1 <- 0x0000(a1)
    8018e04c NOP
    8018e050 BNE t1 != r0 --> 0x8018e014
    8018e054 NOP


    I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.

    The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:


    enter_8018e014:
    load n64 registers into cached regs

    perform various calculations on cached regs

    if some-condition
    flush dirty cached regs back to n64 regs
    goto enter_8018e038

    perform various calculations on cached regs

    flush dirty cached regs back to n64 regs

    if ok-to-continue
    goto enter_8018e014
    exit_8018e014:
    ...

    enter_8018e038:
    ...


    The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:


    enter_8018e014:
    load n64 registers into cached regs
    mark modified regs as dirty

    loop:
    perform various calculations on cached regs

    if some-condition
    flush dirty cached regs back to n64 regs
    goto enter_8018e038

    perform various calculations on cached regs

    if ok-to-continue
    goto loop

    flush dirty cached regs back to n64 regs
    exit_8018e014:
    ...

    enter_8018e038:
    ...


    In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.

    This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.

    I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet

    -StrmnNrmn

    *The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate

  2. #2
    DCEmu Legend ICE's Avatar
    Join Date
    Aug 2006
    Age
    34
    Posts
    3,697
    Blog Entries
    6
    Rep Power
    107

    Default

    NICE another 15%!

  3. #3
    DCEmu Old Pro
    Join Date
    May 2006
    Posts
    1,386
    Rep Power
    101

    Default

    He's amazing

    how does one understand all of this? its so cool!

  4. #4
    DCEmu Coder BrooksyX's Avatar
    Join Date
    Feb 2006
    Location
    Washington, U.S.
    Age
    34
    Posts
    1,336
    Rep Power
    84

    Default

    Most of that was way over my head.

    Anyways, sounds like R13 is going to be pretty sweet. I can't wait.


  5. #5
    DCEmu Legend
    Join Date
    Sep 2006
    Location
    USA
    Posts
    2,152
    Rep Power
    75

    Default

    That's really amazing that after coding so brilliantly, he still takes the time to write lengthy blog updates to go into detail on what he has been working on.

    R13 sounds like it could be the most exciting release yet. Well, they are all exciting. But with a 10-15% speedup, and more ideas and optimizations under way, I am pretty excited.

    Thank you StrmnNrmn!

  6. #6

    Default

    Well that was completely over my head... but when StrmnNrmn talks, the homebrew scene listens. Can't wait for the next release. Keep up the great work!

  7. #7
    DCEmu Coder Safari Al's Avatar
    Join Date
    Mar 2007
    Location
    http://homebrewheaven.net
    Posts
    863
    Rep Power
    0

    Default

    Strmnnrm sounds like the next Albert Einstien the way he talks. Nobody understands a thing but they listen. I wonder what game he is working on making compatible?
    Come Visit Homebrew Heaven, Where you'll find the latest gaming news and downloads!

    View My Coding Blog


    The Return of The Lounge!
    Mario Gold Rush
    Current C++ Project: To be Announced soon on Homebrew Heaven

    Currently Coding in: C++ for the PSP

  8. #8
    DCEmu Rookie
    Join Date
    Sep 2006
    Posts
    170
    Rep Power
    65

    Default

    he must know alot about MIPS.

  9. #9
    DCEmu Newbie
    Join Date
    Aug 2006
    Posts
    84
    Rep Power
    0

    Default

    This is fantastic news! Any kind of speed up is beautiful. I can't wait for this to come out. The best part is that he stated he had more optimizations in mind, which means more speed! This release will be exciting!

  10. #10
    DCEmu Regular jurkevicz's Avatar
    Join Date
    Aug 2006
    Posts
    389
    Rep Power
    66

    Default

    Can you imagine how Goldeneye must be going now! Hopefully it will be playable.

Page 1 of 16 1234511 ... LastLast

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •