[WIP] YAPSxP: Yet Another PSX Emulator for PSP

**mavsman4457** · November 5th, 2006, 23:30

Originally Posted by Veskgar

Great news. I think its fair to say that a PS1 Emulation is the HOLY GRAIL of homebrew. I'm glad that there are so many W.I.P. projects. The homebrew scene is bound to give SONY a run for its money someday when it comes to PS1 Emulation.

An Emulator written from scratch would be fantastic.

As for now a PS1 emu would be the holy grail but eventually it will be all about devhook. The future firmware updates will give us many things we never saw coming to a PSP. Especially if you own a PS3.

Originally Posted by hlide

Okay, my translation must suck a lot but it is very late now, so be indulgent.

Good night !

Nice to see that you have joined the DCemu network. Goodluck to you with your emulator and it seems that you will be taking the Exophase path by starting from scratch and reaping the benefits. Good luck and never give up.

**Exophase** · November 6th, 2006, 01:05

Why hlide is awesome:

He actually posts on boards. And with a lot of technical information about his project.

About PCSX using doubles for GTE.. that's pretty sad. Back in the day it was expected to use 64bit ints for GTE in PS1 emulators, because it is afterall fixed point. The better emulators used this, but the open ones didn't seem to. I remember trying to convert some to this (one that may not have been open source, I don't remember). Anyway, this is necessary for full precision, but I have been told that the VFPU's precision is sufficient. Which is good because MIPS is almost as bad as with 64bit int as it is with 64bit floats. VFPU is 32bit float, but the intermediate calculations (multiplies + adds) are probably handled with more precision than this.

I assume the "core 0" that exists right now is a threaded interpreter. Turning that into a "full" dynarec shouldn't be too hard (path I took with gpSP). Fortunately MIPS opcodes are easy to emit.

Hopefully GPU -> GU will go down well, I don't know all of the specifics but I bet PSP is up for the task. It has all of the basic OpenGL features, plus (as far as I'm aware) the ability to take textures from anywhere in VRAM and hopefully the ability to render outside the framebuffer too, if it has that I bet there won't be problems mapping the GPU straight to it. Unfortunately PSP doesn't have a lot of VRAM so you can't really enhance the PS1 display much.. but it'd be great if you could at least store PS1 VRAM completely in PSP VRAM with an exact 1:1 mapping. It'd be very fast and accurate.

**pkmaximum** · November 6th, 2006, 01:21

I'll believe it when I see it.

**Exophase** · November 6th, 2006, 02:00

Originally Posted by pkmaximum

I'll believe it when I see it.

You doubt everything, don't you..? It's not like the guy said he has a perfectly functional PS1 emulator and is sitting on it. Just that he's in the process of writing one.

This scene has a lot of people who believe everything or believe nothing when it comes to upcoming releases, unfortunately quzar is right about this... that belief or disbelief is almost never based on any evidence or even common sense.

**hlide** · November 6th, 2006, 02:01

Originally Posted by Exophase

Why hlide is awesome:

He actually posts on boards. And with a lot of technical information about his project.

About PCSX using doubles for GTE.. that's pretty sad. Back in the day it was expected to use 64bit ints for GTE in PS1 emulators, because it is afterall fixed point. The better emulators used this, but the open ones didn't seem to. I remember trying to convert some to this (one that may not have been open source, I don't remember). Anyway, this is necessary for full precision, but I have been told that the VFPU's precision is sufficient. Which is good because MIPS is almost as bad as with 64bit int as it is with 64bit floats. VFPU is 32bit float, but the intermediate calculations (multiplies + adds) are probably handled with more precision than this.

I assume the "core 0" that exists right now is a threaded interpreter. Turning that into a "full" dynarec shouldn't be too hard (path I took with gpSP). Fortunately MIPS opcodes are easy to emit.

Hopefully GPU -> GU will go down well, I don't know all of the specifics but I bet PSP is up for the task. It has all of the basic OpenGL features, plus (as far as I'm aware) the ability to take textures from anywhere in VRAM and hopefully the ability to render outside the framebuffer too, if it has that I bet there won't be problems mapping the GPU straight to it. Unfortunately PSP doesn't have a lot of VRAM so you can't really enhance the PS1 display much.. but it'd be great if you could at least store PS1 VRAM completely in PSP VRAM with an exact 1:1 mapping. It'd be very fast and accurate.

1) GTE : an example is much speaking, take "rtps";
it takes a rotation matrix that multiplies a vector then translate the result vector then project it in 2D. The rotation multiplication with a vector can be done totally with a simple VFPU instruction without loss precision. But addition with translation vector can overflow, so you just need to convert the result vector in integer then use a 64-bit addition with translation vector (which wouldn't take more than 4 or 5 instructions). Anyway I need to retrieve them like integers so I can set the GTE FLAG register in case of overflow as a real GTE would do. However I wouldn't expect a real speedup but it shouldn't be worse than using double. Another possibility is to work on 64-bit integer and use "madd" instructions but you need to do inline assembly as well since gcc doesn't seem to generate them implicitely :/

2) CORE0 generates a sequence of instructions "terminating" with a "jr $ra" instruction so it can return to the dispatcher for each "recompiled" instruction. A further step would be simply to remove this instruction so that we may execute the biggest sequence possible. Another thing is to translate jump and call as possible which can be done two ways : to keep a map of source and target address to patch at the end of a recompilation and before execution or to insert a temporary jump to a function which would patch this jump with the right address. Previously, I did some tries with a very simple recursive recompile function that works well with my small PSX-like code test, but I can imagine for a very big code, you may exceed the PSP stack and crash :/.

3) I'm working on GPU->GU. The best thing to do is to use the same operation on GU if it exists. Well, I suppose if I could find some open source on good GPU OpenGL plugin (I found two but they sounds incomplete), it may help with me to avoid some caveats and to be aware of some hacks. It's my currently priority : to have a working GPU->GU with an exact 1:1 mapping. I think we can do it though I probably need to dig more about GU and OpenGL since it is not something I used to use.

oh my god , i would like to sleep but i cannot :/

**Exophase** · November 6th, 2006, 02:19

This is a bit far off, but if you ever get recompiled GTE working you can use liveness analysis to reduce the flag generation. You'd probably eliminate most flags this way since they're not often used. Course, you could also cache the GTE registers in VFPU ones this way.

The overflow is for 44 bit fixed point, right? I don't really understand why it can't overflow intermediately, for each of the multiplies + adds in the dot products, you can have results much larger than 32bits. I imagine that the calculations are all 44bit internally on PS1.. maybe PSP's VFPU has a lot of internal precision and overflow flags too?

Also, even with enough bits you don't necessarily have the exact precision to represent 32bit fixed point in 32bit floating point. Hopefully what you do have is "close enough."

I think there is actually an option to tell GCC to generate MADDs. It usually avoids them because they're usually not worth it (because of having to play with the hi/lo registers), only really when doing exactly these vector operations. But if precision/flags are not an issue there's no way you'd get anything near the performance with madd as you would with VFPU.

**hlide** · November 6th, 2006, 02:45

Originally Posted by Exophase

This is a bit far off, but if you ever get recompiled GTE working you can use liveness analysis to reduce the flag generation. You'd probably eliminate most flags this way since they're not often used. Course, you could also cache the GTE registers in VFPU ones this way.

The overflow is for 44 bit fixed point, right? I don't really understand why it can't overflow intermediately, for each of the multiplies + adds in the dot products, you can have results much larger than 32bits. I imagine that the calculations are all 44bit internally on PS1.. maybe PSP's VFPU has a lot of internal precision and overflow flags too?

Also, even with enough bits you don't necessarily have the exact precision to represent 32bit fixed point in 32bit floating point. Hopefully what you do have is "close enough."

I think there is actually an option to tell GCC to generate MADDs. It usually avoids them because they're usually not worth it (because of having to play with the hi/lo registers), only really when doing exactly these vector operations. But if precision/flags are not an issue there's no way you'd get anything near the performance with madd as you would with VFPU.

1) I map GTE registers on VFPU indeed in integer forms, that is on 4 matrixes. To calculate, I convert them to float with the needed precision then make some float operations then convert them back to integers. VFPU being undocumented, i don't know how to get the flags directly from float and for MAC0/1/2/3 you cannot do it on float because of loss of precision : IR0/IR1/IR2/IR3 are truncated part of MAC0/1/2/3 so you will loss the less significant bits in IR0/IR1/IR2/IR3 and have a different behavior than a real one.

EDIT: I'm probably wrong since IR0/1/2/3 keep the least significant bits and if overfow they are clamped values of MAC0/1/2/3. It is only MAC0/1/2/3 that would be different, I guess. I may need to rethink...

Why I cannot map GTE register in float forms, because some GTE instruction expects to find some input registers to be set in one of two representations 1:31:0 or 1:19:12 (outer product op0/op12 for instance) so you will have a loss of precision from the begining !

2) I think "madd" can be a good use for 64-bit integer calculations as it can add and multiplies 64-bit integers in one instrcution (but i don't check its count cycles, that's true). You simply need to set lo/hi registers at the begining and get back the lo register at the end after three "madd"s to calculate rx = vx*r11 + vy*r12 + vz*r13 + trx for instance. Here just set lo/hi with trx and then use 3 "madd"s to get back rx from lo register. It doesn't look bad for me, or am I wrong ?

**Exophase** · November 6th, 2006, 03:13

1) Even if VFPU has overflow it's totally different, because it's floating point the results of these operations will never overflow (since they fall within the overall range of 32bit float). Instead you'll end up losing a lot of information, but you can at least determine if it would have overflowed integer-wise with a comparison, as usual (I have no idea how to even test VFPU regs though, you might have to put them back into integer regs..)

Anyway, do you think it might be possible to keep the GTE regs as float then only convert when going to/from the CPU and them? I wonder. Anyway, setting all the flags alone probably takes more time than doing the math, especially when in VFPU. Dead flag elimination would certainly go a long way since you probably almost never need any of them, however you'd want large blocks or superblock analysis to get anywhere with this. But even saving it with some typical GTE instruction blocks would be a good win. You do have to set as many as 19 flags, the computation involved for all of that is staggering. If you don't have to set flags then you can do the saturation instructions pretty quickly using the min and max instructions (and perhaps keeping some constants in registers). You can do this for either VFPU or integer implementation (of course, since you can do them in parallel for VFPU it'd be even better there).

2) I expect madd to be one cycle, and it's true that it is pretty good compared to what you'd be doing otherwise. The annoying thing is that you have to pull all of those values into registers, although that isn't too bad either. Still, it's dozens of instructions for a matrix multiplication alone. On VFPU it's only one instruction..

**hlide** · November 6th, 2006, 03:18

Originally Posted by Exophase

This is a bit far off, but if you ever get recompiled GTE working you can use liveness analysis to reduce the flag generation. You'd probably eliminate most flags this way since they're not often used. Course, you could also cache the GTE registers in VFPU ones this way.

liveness analysis ? you mean to foresee which register is used before recompiling an atomic block of orginal instructions so you can optimise the generated code depending of what registers it uses ? well, that also mean you need to generate a sequence of instructions for each gte instruction instead of calling them. Does it really worth ?

I read somewhere some games really use those flags but it would be great if indeed no game is really using those flags and that would simplify and speed up a lot GTE emulation for sure.

**NoQuarter** · November 6th, 2006, 03:25

I love reading things over my head...

Great job hlide for taking on this project,thank you.
I just started using exophase's emu and I must say it's phenomenal!
Thank you both for your efforts

Thread: [WIP] YAPSxP: Yet Another PSX Emulator for PSP

Thread Tools

Thread Information

Users Browsing this Thread

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions