Emulator Optimization

The topic of how to write an emulator comes up quite often. Typically due to the fact that my emulator is quite slow. Let's throw out the absolute accuracy vs speed hack discussion entirely, that's a somewhat separate matter. This article deals with two approaches that achieve the same end result, bit-for-bit.

As always, I ask that you read my articles as opinion pieces. I am not trying to convince anyone to agree with me -- I'm merely presenting my side of reasoning.

Two approaches: optimize for speed or code readability?

First, we need to clear off some misconceptions. This isn't about discussing superficial optimizations, such as "x * 2 + y * 2" or "(x + y) * 2". Obviously, the latter is always superior. This is about large-scale optimizations that greatly affect the entire structure of an emulator, and its overall maintainability.

That said, there are two general philosophies. The first is to optimize code, so that it runs as quickly as possible. This has its advantages, even if a computer is fast enough for unoptimized code: it allows better multitasking capabilities, and it also saves on battery power for portable devices. Both very good things.

The second is to forego optimizations and instead focus on code being easily readable and well-abstracted. This has the obvious advantage of being far easier to read for everyone, as well as being easier to maintain in the future.

My feeling is that the latter is a better choice, and here is why.

Structure

As always, it's best to talk about what someone knows. And what I know is SNES emulation, so let's use that as our example.

The SNES has two general purpose processors: the S-CPU for general programming and video control, and the S-SMP for audio control. The S-CPU is a very complex (at least, for its time) processor: it features a DMA controller that can even schedule small DMA batch transfers during each scanline as the screen renders, an interrupt unit allowing both NMI and programmable IRQs to trigger based upon the raster position of the video processor and a hardware math unit -- all on top of the actual processor core itself. The S-SMP, by comparison, is a relatively simple processor. It lacks DMA, interrupts and any sort of math unit. Further, the two processors very closely follow the "black-box" paradigm. That is, they do not share an address bus, nor any memory. In fact, they can only communicate via four 8-bit register ports.

This makes the S-SMP an ideal candidate for enslavement. This is where you allow a primary processor, in this case the S-CPU, controler the secondary processor, in this case the S-SMP. How this works is that the emulation core treats the emulator as though it only has one processor: the S-CPU. When the S-CPU performs a time-consuming operation, eg a memory access or an internal operation cycle, the S-CPU will keep track of how far "behind" in time the S-SMP is. If needed, it will invoke the S-SMP to run for a small amount of time and then return.

Enslavement is a great example of an optimization at the expense of maintainability. To see why this approach is faster, let's look at how an emulator without enslavement would work.

With an emulator, you only have one process with which to run your entire application. This forces you to implement each emulated processor core as a state machine. Take a DMA transfer for example: you want to move a large block of memory from one address to another. But you can't simply move the entire block at once, or too much time will pass. Your emulator will be unresponsive, video will not get updated, sound samples will not be generated, etc. So instead, you have to break the transfer up. Each time you ask the S-CPU to run, it will then transfer one byte and then return control. The same for the S-SMP executing a single instruction. The more accurate you want the emulator, the less time you can allow to execute before you have to return control for other processors to run. What happens is that eventually you spend most of your time simply running the state machine to remember where you are at within a process, and very little time actually emulating the system.

Enslavement helps this by allowing the S-CPU to complete much larger operations. Since there are only two general processors, the S-CPU can transfer a much larger chunk of memory, and with each byte, it can invoke the S-SMP as needed. This allows the same degree of accuracy, and allows the more complex S-CPU to execute much more before needing to return control back to the emulation core.

Similarly, the S-PPU video processors can be enslaved to the S-CPU, and the S-DSP audio processor can be enslaved to the S-SMP.

While at first this may seem like a win-win situation, let's look at the downsides. On a real system, the S-CPU has no knowledge that there is an S-SMP. Seriously. They only share 32-bits of data between each other. The S-SMP could easily be replaced by another processor entirely, and the S-CPU would be none the wiser. This is simply how the hardware is structured. By adding enslavement, we are deviating from the design of hardware.

That alone may not be very convincing, but let's imagine we want to adjust the S-SMP timing. In order to do that, one would have to tweak the synchronization within the S-CPU processor. Would this make sense to you if you were studying a SNES emulator? Would the code be self-documenting if it didn't work at all like an actual SNES? Would you think to check the S-CPU to adjust the speed at which the S-SMP ran? Okay, so why adjust the S-SMP speed at all? Well, it's well known that the S-SMP's clock tends to vary quite a bit. On real hardware, it's been observed to run between ~24.576mhz and ~24.607mhz. It is not unreasonable to allow the S-SMP clock rate to be modified.

... and you can do that with enslavement, too. You just have to add the logic to accept a dynamic clock inside the S-CPU. You can even hide the S-SMp clock rate in the S-SMP class. But what if you never added this? What if you wanted to add this later on? You can't always account for the future. Suddenly, it would seem much easier to do if the S-CPU and S-SMP ran independently from each other in advance.

And this boils down to the crux of the argument: writing code for readability is not so much about what we already know, but preparing for what we don't know. What if there is yet an unknown issue that would complicate this enslavement model? Here's a good example: our limited understanding of the S-SMP CONTROL register, which allows software control of the S-SMP's execution rate. No SNES emulator currently accounts for this. Which emulator would you find implementing such a new feature into easier? One that hides its synchronization inside what should be an unrelated processor, or one which controls all synchronization in a centralized place, outside of individual processor cores?

Or what if you were just reading the S-CPU source code to see how it works. Would you want that code cluttered with unrelated S-SMP references? And what if you just wanted to use the S-CPU in isolation? Perhaps as a basis for an Apple II-GS emulator? Well, you'd of course have to decouple it from the S-SMP, which the Apple computer does not use.

Or something even more radical, and a real-world example: what if you wanted to replace the entire processor core with another one? Say, from another emulator, or one of your own design? Now you have to update the new processor core to also enslave your S-SMP. I have done this, twice in fact. The first time was to replace the opcode-based processor cores to cycle-based processor cores, and the second time to refine that to bus-level processor cores implemented using cothreads -- more on that in a bit. Were I to have used enslavement, this task would have been much more difficult.

Combine that with another optimization that doesn't by itself decrease accuracy, such as say writing your processor core in pure assembler, and suddenly replacing a processor core goes from a weekend job to something requiring several months, or maybe even years, of effort. See the ZSNES v2.0 rewrite as an example of this.

Now, back to that previous thought: cothreads, or cooperative threads. By not using the enslavement model, I was able to utilize these. They are definitely slower than state machines, as process context switches are very painful for today's heavily pipelined and predictive processors. But they allow for some amazing advantages to code readability.

Remember how I was saying that the S-CPU could reduce the complexity of its state machine (but not the S-SMP)? Well, with cothreads, we can completely eliminate the state machine entirely! It can instead be replaced by a single yield() call inside each processors' timing unit.

So now, we have two entirely separate processors, implemented in two entirely separate classes that have no knowledge of each other. And best of all, there's not even a state machine involved. To the casual observer, the source code reads like said processor was the entire program! Words really cannot describe what an advantage this is, even if it is slower than enslavement. Perhaps a code example will help.

//optimized state machine version, for use with processor enslavement
void CPU::asl_addr() {
  switch(cpu.opcode_cycle++) { default:
    case 0:
      aa = op_readpc();
      break;
    case 1:
      aa |= op_readpc() << 8;
      break;
    case 2:
      rd = op_readdbr(aa.w++);
      break;
    case 3:
      rd |= op_readdbr(aa.w) << 8;
      break;
    case 4:
      rd <<= 1;
      regs.p.nz = rd; //n, z flags decoded only when needed
      break;
    case 5:
      op_writedbr(aa.w--, rd >> 8);
      break;
    case 6:
      op_writedbr(aa.w, rd);
      cpu.opcode_cycle = 0;
      break;
  }
}

//readable cothreaded version, for use as an independent processor
//note, .l (low-byte), .h (high-byte) unions on .w (16-bit word) used for clarity
void CPU::asl_addr() {
  aa.l = op_readpc();
  aa.h = op_readpc();
  rd.l = op_readdbr(aa.w + 0);
  rd.h = op_readdbr(aa.w + 1);
  rd <<= 1;
  regs.p.n = (rd & 0x8000);
  regs.p.z = (rd == 0);
  op_writedbr(aa.w + 1, rd.h);
  op_writedbr(aa.w + 0, rd.l);
}

... the code really does speak for itself. Which would you rather try and locate a bug in?

In all honesty, being just one person, achieving the compatibility I have in such a short time -- it really does speak for itself. Doing what multiple groups of independent developers optimizing for speed could not in a decade. I truly believe that focusing on optimizations would have severely hampered my ability to achieve what I have. Note that I only bring this up to prove this point, and not to brag: even my emulator is a miserable failure in my eyes to properly recreate the original hardware, and I have a long way to go yet. Let's consider optimizations when and only when I reach my goal of near-perfectly recreating the original hardware, yes?

Relevence in Time

Another very important thing that is often not considered is the relevence of such aggressive optimizations in the future. Imagine for a moment a world where Nintendulator and NESticle came out at the same time. Where the fastest computer around was a 166MHz Pentium. Can you imagine anyone saying any good things about Nintendulator, which requires at least a 1GHz CPU? Now how about today, when even $300 sub-notebooks can run Nintendulator at full speed. The result is that NESticle is the one most often ridiculed. Of course, that's more a debate of accuracy vs speed hacks ... substitute Nintendulator with FCE Ultra or something if you prefer. The point I am trying to make is that worrying so much about speed today tends to look pretty silly in the future.

For me, I think about the long-term. I'm not worried about computers today. I'm worried about the day when a real SNES unit sells for several thousands of dollars, and is completely out of reach of the casual gamer. When it will no longer be possible to run hardware tests to improve emulation. I want to do everything I can to be as prepared as we can for that day. With so many absolutely fantastic SNES emulators already around that run at full speed on modern computers, it seems silly for me to compete with them when I can do something new, something I personally find much more valuable in the long run: preserve history.

Surely, the system requirements of bsnes won't seem so extreme in ten years when even low-end cellphones can run it at full speed, right?

In Closing

This isn't to say I completely ignore optimizations: I do implement some anyway, such as my PPU tile caching method. It's also not to say that I'm great with optimizations. Bottom line is that I'm not perfect, and my emulator is a work in progress.

Again, I'm not trying to argue that everyone should abandon speed. In fact, there's an enormous amount of advantage in having emulators that can run at full speed on modern computers. In my case, these already exist. But that isn't the case for all systems. What I'm saying is that there's room for both approaches!! Neither is necessarily "right" or "wrong", there is no black and white, only various shades of gray. It would seem highly prudent to all coexist, would it not?

Lastly, I don't mean to silence those giving me feedback that my emulator is too slow: in the paraphrased words of the late Randy Pausch, people complaining about you and your work in a good thing, indeed! It shows they care about it! It's when they stop complaining that you need to start worrying, for that is when they have given up all hope on you.