The Super Nintendo Entertainment System was, in many ways, the most advanced game console of its era, featuring 128kiB of work RAM, tilemap scaling and rotation, a massive color palette of 15-bit colors, translucency, flexible graphics modes, DMA, HDMA, and many other acronyms. But SNES gamers also know that it had an Achilles’ heel: speed. The pokey S-CPU wasn’t always the fastest kid on the block, and with SlowROM, it can be even slower than that. Let’s take a look at one particular game that runs slowly, Fatal Fury, and see if we can speed it up a tad.

Fatality

To give this story some “human interest”, for the past few weeks, I’ve seen this sitting on my dresser. A copy of Garou Densetsu for the Super Nintendo. I found it at a thrift store alongside some jewelry; because I wanted the jewelry on my dresser, I never bothered to move the game.

Cartridge sitting on the dresser

The Super Nintendo port of Fatal Fury, as it’s known in the west, was a fairly successful game as far as I can tell, but also felt like an also-ran compared to the juggernaut that is Street Fighter II. Getting rid of many of Fatal Fury’s unique elements like the 2-on-1 battles, the arm wrestling, and two-line combat, the Super Nintendo version moved closer to its Capcom counterpart, though it still maintained the SNK game’s greater focus on story, along with its limited number of characters.

Fatal Fury title screen on SNES

Still, I’m a big fan of SNK’s works, and the SNES port is still pretty solid; I imagine it was a pretty fun rental back in the 90’s. But there’s one thing about the game that’s always bothered me. Before a match, you see a screen like this:

The text WAITING on a blue background

A loading screen! Remember, the Super Nintendo never got a CD attachment; this is a ROM chip hooked up directly to the Super Nintendo. So why does it have this loading screen? And what can we do about it?

Tracing it out

First off, what’s going on during the waiting screen? For this we turn to a technique we also used back in the ALF saga: trace logging. The Super Nintendo uses a WDC 65C816 CPU, a derivative of the famous MOS 6502, but other than that the concept is the same. As a single-threaded CPU, it runs one instruction at a time; we log that instruction.

Mind you, trace logs are huge. A log just through the waiting period is a massive 272MiB.

I expected most of the time being spent to be copying to VRAM (writing to $2118; this page is a must read); and while that did happen a lot, the vast majority of time is spent on this loop:

00871a  ldx #$0000                       A:0055 X:0004 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC  V:117 H: 682 I:0
00871d  ror $17                [000017]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC  V:117 H: 706 I:0
00871f  ror $16                [000016]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H: 744 I:0
008721  ror $15                [000015]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZc  V:117 H: 782 I:0
008723  ror $14                [000014]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZc  V:117 H: 820 I:0
008725  ror $13                [000013]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC  V:117 H: 858 I:0
008727  ror $12                [000012]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC  V:117 H: 896 I:0
008729  ror $11                [000011]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC  V:117 H: 934 I:0
00872b  ror $10                [000010]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC  V:117 H: 972 I:0
00872d  bcs $8738              [008738]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H:1010 I:0
00872f  stz $0c20,x            [000c20]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H:1026 I:0
00872d  bcs $8738              [008738]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H:1010 I:0
00872f  stz $0c20,x            [000c20]  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H:1026 I:0
008732  inx                              A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc  V:117 H:1064 I:0
008733  dec $0d                [00000d]  A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc  V:117 H:1078 I:0
008735  bne $871d              [00871d]  A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc  V:117 H:1116 I:0
00871d  ror $17                [000017]  A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc  V:117 H:1138 I:0

It’s a bit more complicated than this, but what it’s doing over and over again is using the ROR opcode on the eight byte values between $00:0017 and $00:0010. These are in RAM; the memory map of the SNES is a bit complicated. To make a long story short, the CPU still bears a lot of resemblance to the 65c02, with its 16-bit addresses, but it can access more memory than that. Special registers determine the “page”, which is the top byte, of a 24-bit address space. The SNES maps things into memory differently depending on which page is being accessed.

In what’s good news for Nicole Express, who doesn’t know the 65c816 very well, the M flag is set, so this is running in 8-bit mode. Specifically this is an 8-bit rotate, using the carry to spread across eight bytes, and then branching to $8738 if carry is set. And what’s there? That’s another loop condition, increasing y and x, and storing data in $0c20,x, an address in RAM.

008738  lda [$00],y            [2cfdd7]  A:0055 X:0008 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC  V:120 H: 476 I:0
00873a  sta $0c20,x            [000c28]  A:0003 X:0008 Y:0009 S:01f2 D:0000 B:00 nvMxdIzC  V:120 H: 524 I:0
00873d  iny                              A:0003 X:0008 Y:0009 S:01f2 D:0000 B:00 nvMxdIzC  V:120 H: 602 I:0
00873e  inx                              A:0003 X:0008 Y:000a S:01f2 D:0000 B:00 nvMxdIzC  V:120 H: 616 I:0
00873f  dec $0d                [00000d]  A:0003 X:0009 Y:000a S:01f2 D:0000 B:00 nvMxdIzC  V:120 H: 630 I:0
008741  bne $871d              [00871d]  A:0003 X:0009 Y:000a S:01f2 D:0000 B:00 nvMxdIzC  V:120 H: 668 I:0

My best guess is that this is some kind of decompression building up data starting in 00:0c20. These decompressed blocks are moved into the 128kiB of WRAM that the SNES has in pages $7f and $7e, and then are transferred to VRAM using the Super Nintendo’s Direct Memory Access, which you can think of as being similar to the special transfer instructions on the PC Engine’s processor.

So it’s pretty much what you might have guessed. Fighting games require a lot of graphics for large, detailed characters, and corporations want to sell smaller ROMs, so compression is unsurprising. Now, why is Fatal Fury so much slower than, say, Street Fighter II? Even the weird lock-on technology version I have?

A Super Famicom Street Fighter 2 game with another game plujgged into it

Eh, who knows. The game might just be less efficiently coded, or the compression is more aggressive– Fatal Fury is a smaller ROM than Street Fighter II. Programming is all about tradeoffs.

Easy fixes?

Well, the easiest fix is to run Fatal Fury on the Neo Geo.

Fatal Fury 1 Neo Geo, with the box art modified to have the SNES cartridge being punched off the building

Okay, okay, that was rude of me. But, why is Fatal Fury faster on the Neo Geo?

  • No decompression needed; graphics are in ROMs right on the video bus, like the NES games that used CHR-ROM.
  • Writes to video RAM can be done at any time. The SNES has to wait for blanking periods.
  • If it needs to do heavy data processing, the Neo Geo has a 12MHz Motorola 68000 CPU, while the SNES has a… how fast is the S-CPU anyway?

A quick look at Wikipedia will tell you that the S-CPU, a Ricoh 5A22 based on a WDC 65c816 core, runs at 3.58MHz. But that’s not quite true. Much like your internet speeds, the Ricoh 5A22 runs at up to 3.58MHz. It can also run as slow as 1.8MHz! This is a variable-speed bus, which is actually an incredibly common technique. The speed depends on the address being read.

Specifically, the areas are:

  • 3.58MHz: Hardware registers, fast ROM
  • 2.68MHz: Slow ROM, work RAM
  • 1.79MHz: Accessing certain peripherals (the controller)

I have to admit, while the most reliable sources, including the source code for the well-regarded higan emulator all say the speed drops for the RAM, I’m a bit surprised– usually, it’s easier to get faster-running RAM. I’m also not sure why the controller is so much slower; perhaps the original backwards-compatibility plans with the NES required a slow controller port that could stay in line with the shift registers in early NES controllers? I’m just speculating here.

What we actually care about here is the difference between slow ROM and fast ROM. See, this depends on what’s in the cartridge. (Of course, what’s in the cartridge was determined by Nintendo, who manufactured all licensed SNES games. And they charged dearly for the difference. Given how NEC/Hudson were selling HuCards to match with a 7.1MHz 65C02 CPU, I have to wonder if they were charging too much, but this isn’t a review of Nintendo’s business practices) In any case, it’s no surprise that Fatal Fury used slow ROM– which means that the CPU is very rarely actually running at that 3.58MHz.

So, would the 33% increase in CPU bus speed (only when accessing ROM) actually matter for us? Well, there’s no way to find out but to do it.

Converting a game to use FastROM isn’t my idea, it’s been done quite a few times. Most famously, Vitor Vilela, known for his work converting games to use the SA-1 coprocessor, has converted several games to fast ROM, including Super Castlevania IV, seeing impressive results reducing slowdown. Check that out for a picture of what proper patch code might look like, and check out kandowontu for even more FastROM games. Keep reading this blog post for a careful explanation of a hackjob.

Converting to FastROM

So, you want to convert a slow ROM game to fast ROM? It’s 2023, after all, and if your flash cart can run Donkey Kong Country (legitimately dumped from a cartridge you personally own, of course) it can keep up with that blazing 3.58MHz clock. You’d hope it’d just be a matter of going to the ROM header and changing a bit, but it’s not that easy. The ROM header on the Super Nintendo doesn’t actually do anything– it was a Nintendo requirement, and emulators may use it to get their bearings, but the console itself never reads it. (Unless the game tells it to for some reason, but then it’s just reading it like any other data in the ROM)

So you look a little deeper, and see the register $420d. This does determine the speed of the bus when accessing ROM, so we’re definitely a bit closer. But there’s a catch: it only applies to pages $80-$ff. These are “mirrors” of the ROM pages in $00-$7d ($7e and $7f are RAM); the data comes from the same place in ROM. But if you access them using their lower addresses, you’ll get the slower bus speed no matter what’s been sent to $420d. And pretty much every slow ROM game uses the low-numbered pages, including Fatal Fury.

So what do you need to do to convert to fast ROM?

  1. Update the header (it’s just polite)
  2. Write $01 to $420d
  3. Make every single ROM access do so through the higher-numbered pages: both the data bank, and code.

That is to say, to move from slow ROM to fast ROM, you actually need to change the code. It’s time to buckle up and do some disassembly.

Disassembly on the Super Nintendo

For disassembly, I used DiztinGUIsh, a disassembler designed to go with a special build of the BSNES-plus debugger-equipped emulator. See, disassembling a Super Nintendo game is a little more complicated than, say, the Z80 code needed to port a game from the Game Gear to the Master System. And it’s all because of the 65c816 CPU. But why does it need to be linked with the emulator?

You might be aware that even today, Intel and AMD x86-64 CPUs boot into what’s called “real mode”, a 16-bit mode that’s compatible with code dating back to the 1978 Intel 8086. (Well, er, close enough for our purposes, anyway) The 65c816 is very similar; it was intended to boot into an 8-bit mode compatible with the classic MOS 6502. That’s what allowed the Apple IIgs to run original Apple ][ software, and over here in the video game console world, allowed Super Mario All-Stars to share a lot of code with the NES releases.

On the Z80 or 6502 or 8086, a lot of the difficulty of disassembling is distinguishing code from data, and lining up where the code starts. Instructions on these processors are variable-length; a 6502 NOP is a single byte, while a jump to a fixed location, JMP $AABB, is three bytes; so you need to know where to start reading code from. Disassembling data, on the other hand, will just give you irrelevant nonsense. (It’s always possible for assembly code to do all sorts of fancy nonsense complexifying this nice picture, but it’s good enough for most code)

All of this is also true on the 65c816. But there’s an added complexity. An instruction’s length can also change depending on whether the processor is in 16-bit mode or 8-bit mode. And the mode could be changed at any point in the program; it’s a global state. And unlike your average x86-64 program, 65c816 programs change mode all the time. Even knowing a certain block beginning at a certain address is code isn’t enough; you need to know what mode it’s expecting to run in.

A way around this is by linking with an emulator playing the game in real time, you let the disassembler know what state the processor is in, and how to proceed. The downside is, you actually need to play the game. This is basically the same thing as the trace log above. I’ve never done this before, but the tutorial on DiztinGUIsh’s GitHub seemed pretty straightforward.

We don’t actually need to play that much of the game. This is mostly a demonstration piece, so I really only need to play through the loading part. But this is good to know if you want to do this yourself, and do a good job: you’ll need to be able to play through as much of the game as you can, to trigger as many weird code cases.

Rewriting to use FastROM

Changing the code bank

One thing I found in my trace is that all the code in Fatal Fury that I found is in bank 00. This is great, because it means that once we start executing code, we don’t need to worry about changing banks. But remember, we’re in bank 00; this means we’re in the slow part of the memory map. And how did we get to bank 00? We need to learn a bit more about the 65c816.

The 65c816 always boots up in 8-bit mode. Specifically, exactly the 6502 with which it is backwards compatible, it boots up and reads the “reset vector”, a two-byte address located at address 00:FFFC. It then jumps to that two-byte address; since it’s a two-byte address, it can’t change the bank, but that’s not a huge deal. What we need is a trampoline function; something that will sit between our reset vector (address 8000) and jump to the fast ROM version located in bank 8000.

And, when we’re writing our trampoline, we might as well also write $01 to that register at 420d, and get that out of the way.

I found a row of zeroes (which hopefully is unused) in bank 00 just above 00:FD6A, and wrote the following:

RESET_VECTOR: LDA.B #$01
              STA.B $420d
              JML $808000                      

LDA.B and STA.B are 8-bit (“byte”) versions of the load and store instructions; since we’re in 8-bit mode when the processor starts up, we know that’ll be the correct ones to use. JML is the 24-bit version of the jump instruction; you might sometimes see it as JMP.L, but the Asar assembler DiztinGUIsh uses wants it this way. Notice that even if we’re in 8-bit mode, we can still do 65c816-specific things.

So it worked, and now we’ve also set the fast ROM speed. But that wasn’t enough; I found that code was still executing from bank 00. What gives?

Interrupts

On the Super Nintendo, you get an interrupt every 60Hz frame. This is common for game consoles of the time, and is pretty much how they keep time. On the 65c816, interrupts work the same way they do on the 6502; you have a maskable interrupt IRQ, and the non-maskable NMI. (Maskable means whether the processor can internally choose to ignore the interrupt, or whether it will always happen– they also execute from different addreses.) The S-PPU frame timer is NMI. There are two vectors for NMI; in 16-bit mode, it’s at 00:FFEA, and in 8-bit mode, it’s at 00:FFFA. But note that in either case, they’re just two bytes.

Fatal Fury uses the same interrupts for 8-bit and 16-bit mode; at 00:FFFA we see a link to the bank 00 address 80EE. But what’s there?

NMI: JML.W [$0100]

A long-jump like we saw before, but this time, to a location that’s listed in RAM. This is a pretty common technique; when the game is doing different things, it might want to call a different function for the NMI. The 65c816 will always follow the same vectors from the same location, and this game maps that location to ROM, so you need a trampoline.

This is a 24-bit jump, so there’s three bytes in RAM; the page is in address 0x0102 in RAM. I guess the developers wanted to have the possibility to store code in other banks if they needed to. Since I don’t see any cases where they do (but I didn’t disassemble the whole ROM), it seems like we just need to write 0x80 to that byte, and hope that it just stays that way. I used a breakpoint in the bsnes-plus emulator to make sure that address isn’t written to, and it never triggered after the first write, so I think we’re in business.

The reset vector code is the only place that writes to that byte of RAM, specifically it does so when it clears out all of memory with 00. That’s just good practice.

LDA.W #$0000
STA.L 7E:0000       
LDA.W #$FFFF        ; set up the block move
LDX.W #$0000        ; ditto
LDY.W #$0001        ; ditto
MVN $7E, $7E        ; block move; code address 00:8038

This is a bit complicated to me, who doesn’t know the 65c816 that well, but what I believe it’s doing is storing the byte 00 to the first bank of RAM (look at the memory map: the top of the RAM at bank 7e is the same “low RAM” at bank 00), and then copying that all across using a 65cx816-exclusive block move opcode MVN, which copies from one bank to the next one byte a time. So it just copies that one 00 all across the first 64k of RAM.

It was easiest to find another empty block and just replace that MVN $7E, $7E with a jump. (They’re both three byte instructions)

SET_HIGH_BANK: MVN $7E,$7E
               LDA.W #$0080
               STA $0102
               LDA.W #$0000
               JMP DONE_SETTING_HIGH_BANK

So it clears the block, then writes a single byte (notice that I’m in 16-bit mode, so I have to write two bytes), then sets everything back to the prior state.

Notice I had to end in another jump, rather than an RTS return. That’s because it just cleared RAM, which includes the stack! If I tried to return, it’d have lost the return address. (Ask me how I know) Normally you’d just include the code inline, but I’m trying to move things around as little as possible since my disassembly is incomplete.

So now our code is running from bank 80, and will stay that way long enough for our test purposes. Fetching instructions should run fast. But what about the data we work with?

Changing the data bank

So, to change the page for data, there’s a special register called the data bank register. For whatever reason, the only way to change it is a PLB; pull the top value from the stack into the data bank register. Take this code from the disassembly:

LDA.W #$0000
PHA         
PLB          
PLB             

This loads a 0 into the A register, pushes it onto the stack, which will put two bytes on if we’re in 16-bit mode (as we are here), then pulls the two bytes back into the data bank register, which is always one byte. However, in this case, the next PLB is actually just pulling a byte register that was already on the stack back to it. So this can definitely get complicated. Thankfully, for most of the time we’ll want PLB preceded by PHA, as that’s the pattern this code seems to use to set a new value.

As noted, looking at the disassembly, it looks like all of the code (that I found, anyways!) is in bank 00. So setting the data bank to 00, which is what it’s doing here, basically just means that I want to load from the code segment.

So you might think it’s just a matter of finding all PHA then PLB, then changing the value in the accumulator by adding $80. And that’s true, but with some exceptions and complicated. $7F and $7E, for example, are mapped to RAM, and shouldn’t be changed– $FE and $FF are not mapped to RAM.

However, looking at the code, while there are cases like the above, it seems that the most common setup was to use LDA.B $48,X, after putting something in the X register. This is a zero-page addressing mode; initially I tried to find the table in RAM (at $48, of course) it’s referring to, but I found it easier to write a small subroutine that loads in the bank, makes sure it’s less than $7E, and then sets the bit for the fast ROM region and sets the data bank.

  LOAD_DATA_BANK: 
                  LDA.B $48,X
                  CMP.B #$7E            ; Are we looking for RAM?
                  BCS DONE_LOAD_DATA    ; Branch if the bank is above $7E
                  ORA #$80              ; Otherwise add $80
  DONE_LOAD_DATA:
                  PHA
                  PLB                   ; Set the register
                  RTS

Doing this through a subroutine makes it slightly slower, which kind of defeats the point if this is called too often.

Does it work?

So, I’m judging here by starting a match between Terry Bogard and Duck King, and seeing how long the text “WAITING” is on screen. The game is being played in both cases from a Super Everdrive, on this very pretty Super Nintendo belonging to my lovely girlfriend.

An original Super Nintendo modified with a clear replacement case

  Time (s)
Slow ROM 1.54s
Fast ROM 1.38s
Time saved 0.16s

That’s right! Through all this effort we’ve gained a massive 10% decrease in loading time. It’s not the full speedup, of course, because any code touching RAM or the registers will run at the same speed. But still, isn’t that nice? It’s worth noting that this method of measurement is only sensitive up to a frame. The SNES runs at about 60fps, so this is about ten frames of difference.

Fatal Fury win screen: Terry Bogard throws his hat

Well okay, honestly, it’s not huge in reality– a tenth of a second probably wouldn’t be noticed. Which is why Takara didn’t bother buying Fast ROM from Nintendo. And why doesn’t it matter? Remember, the code is most likely decompressing graphics into RAM and then copying them into video RAM using DMA. Which is to say, all those ROR will run a bit faster, but everything else can’t speed up.

So what’s the difference between Fatal Fury and Super Castlevania IV? In Super Castlevania IV, the problem being solved was slowdown. A 10% speedup that causes you to not overrun a frame makes all the difference. In Fatal Fury, though, we’re just looking at loading screens. Probably the best way to speed this up would be to use a bigger ROM and store uncompressed data; I’ll leave that to someone else. Here’s an IPS patch you can apply if you want to try my changes for yourself.

Fatal Fury post-win screen: Terry Bogard says 'Did you think you could beat me? Go home!'

The dog still wasn’t very shaggy

So, while I started with the Japanese version of Garou Densetsu I happened to have on my dresser, I immediately jumped to the US release of Fatal Fury. So, why is that?

Fatal Fury Japanese title screen

Well, there are quite a few regional differences between Garou Densetsu and Fatal Fury. The pre-battle portraits are different. The Japanese version has no title screen scroll about the King of Fighters tournament; the US one has no blue Takara logo screen. But one notable one is that the “WAITING” screen doesn’t exist at all. It still takes the same amount of time to load; it just doesn’t give you any feedback while it does so.