PDA

View Full Version : Emulate display implementation



obo
June 17th, 2004, 13:09
I've done graphics development for with DirectX, OpenGL, SDL, Allegro, etc. but am new to DC development. I've spent a couple of days playing with the KOS examples, looking at the best ways of doing various things.

I'm after a fast way to manipulate a 640x480 display for an emulator screen. So far I've seen you can have raw frame-buffer access (too slow?), work with OpenGL/SDL (as thin-ish layers on top of PVR?) or go straight to the PVR hardware. On the PVR side there is a mention of twiddled/non-twiddled textures (which is new to me), and there's a possibility of DMA for texture transfers (would be nice).

So far it's looking like I should render directly to PVR textures, which are then uploaded using DMA, perhaps using twiddled textures for best speed. I can pre-twiddle all my colour data in advance, or perform any other massaging to cut the run-time work.

Am I on the right track, or is there a better way? Short of using ASM, I don't mind getting my hands dirtier if it leaves more CPU time for the rest of the emulation.

Any help or suggestions would be most appreciated :)

BlackAura
June 17th, 2004, 13:29
Depends on what exactly you're trying to display.

Twiddling is a method of scrambling the pixel data in a texture so that it renders faster when you enable texture filtering. It's also known as swizzling. The colour data is the same, but the order of the data is all weird.

If your emulator outputs in 16-bit colour (or you can modify it to output in 16-bit colour), rendering to a texture is definitely your best bet. You don't need to twiddle 16-bit textures. The hardware will draw a non-twiddled texture slower if you're using filtering, but for just one polygon you're going to spend more time copying the texture over than it takes the hardware to render it, and the hardware will render it while you're off working on the next frame.

There are two good ways to get the texture over to vram - DMA and store queues.

If your emulator is rendering the display strictly from left to right, top to bottom, and you can break the rendering loop up into 16 pixel blocks, store queues will be much, much faster than DMA. A store queue is basically a small buffer inside the CPU which you write the data out to, and then tell it to write to VRAM. It'll then dump the data out to VRAM while you're busy filling up the next store queue with the next set of pixels.

If the rendering code is rendering things all over the place, DMA will be your best bet. Basically, if you've got a large block of data already in main memory that you want to transfer, use DMA. If you're generating a block of data in the correct order, store queues are faster.

If it outputs in 8-bit colour, you'll need to convert it to 16-bit colour, or twidde it in real time, because 8-bit textures have to be twiddled. Store queues are probably best for this.

The ultimate way is to directly use the PVR hardware for all rendering. Depending on the hardware you're emulating, this can be trivial (like really early arcade games, Sega Master System), difficult (like a MegaDrive), or nearly impossible (like anything that doesn't use tile-based displays).

It might help if we knew what kind of hardware this is. Just a clue, like what kind of graphics system it has, or basically how the renderer works.

obo
June 18th, 2004, 14:04
Many thanks for the detailed reply. Â*:)

The emulator builds the display in an internal 8-bit format, with values 0 to 127 corresponding to palette colours on the emulated machine, plus a few more for the emulator GUI. Â*At the end of each frame it determines the changed lines and does a block conversion/update of those lines to the real display.

On other platforms I support all possible colour depths, just with different conversion code. Â*On palettised 8-bit systems I can usually get away with copying the internal data as-is, as long as a matching native palette can be set up. Â*For the other depths I use a look-up table to fetch pre-built native pixel values (does the limited L2 cache on the SH4 make it worth calculating in real-time instead?!)

The closest I've come to the DC method so far is with OpenGL, which (for maximum PC card compatability) uses a tiled area of 256x256 textures to cover the overall display area. Â*I update selective horizontal blocks in the textures with glTexSubImage2D - a lot of CPU-involved texture uploading! Â*OS X makes it a little easier by allowing textures to be stored in system RAM, with AGP transfers to pull them across as needed (seemingly without CPU involvement). Â*The DC's fixed hardware should make it much less complicated, and be tailored for a single case.

Assuming I can update sub-portions of an existing texture, it seems like I should use a single 1024x1024 texture, even though only 640x480 of it will be visible. Â*Using a texture over plain VRAM access means I can include the optional stretching feature of the emulator, which corrects for the non-square pixels of the real machine on a TV (about 25% wider).

I'd not come across store queues until you mentioned them, and they do sound ideal for the job! Â*Hopefully I can be doing my depth conversion while the previous queue is being transferred, to keep it fairly lean. Â*The existing internal format the emulator uses makes it hard to avoid some conversion during frame updates. Â*I'd like to avoid changing the common emulator core, though I'd certainly be tempted to use an ASM frame compare routine, depending on how poorly memcmp is implemented in gcc!

The slowest system I've had the emulator running full speed on is a 400Mhz Xscale (ARM-compatible) Pocket PC, but I'm hoping the 200MHz SH4 in the DC is also up to it. Â*The cycle-accurate C/C++ core is against me in the speed stakes, but I'm hoping a lean-and-mean video module can help offset that. Â*This weekend I'll try getting the core running to give me an idea about speed - frame skipping is a last-resort really.

btw, I do like the fact that you can change the border colour on the DC - ideal for seeing how long things are taking. Â*It reminds me of my old ZX Spectrum days Â*;)

BlackAura
June 20th, 2004, 00:35
Cool. Let us know how you get on then!

It's a shame that we can't use 8-bit or 4-bit textures without twiddling them though. The twiddling code is an absolute mess, and it's really, really slow. Were it not for that, displaying things in 4- or 8-bit would be so much easier.

What does the frame compare function do? Would I be right in assuming that it compares the previous frame to the current one, and works out what to update? It might be unnecessary on the Dreamcast. Reading things back from main memory is slow, and it might be quicker to just update the entire texture (or at least the part of it that has an image on it).

If your image is 640x480, you can use a 1024x512 texture for it. A 1024x1024x16 texture takes up around 2MB of VRAM, which means that you won't have much memory spare for other things should you ever need them. The Dreamcast can do non-square textures, but it can't do non power-of-two textures. Anything from 8 up to 1024 is valid, as long as it's a power of two.

Oh yeah.... there is no L2 cache on the Dreamcast. The CPU has 8KB of instruction cache, and 16KB of data cache, and that's all. The cache is pretty crappy too - it's a fairly simple direct-mapped cache with a line size of 32 bytes.

The lookup table would be 512 bytes, right? It probably isn't worth trying to calculate it in real time. Most of it's going to be sitting in the cache most of the time, and if your internal buffer is more than 16KB you're going to be wiping the lookup table from the cache a few times. Probably not a big deal. You're still going to have to read the colour palette from somewhere, so you may as well be reading a palette you can use.