The Super Nintendo Entertainment System was, in many ways, the most advanced game console of its era, featuring 128kiB of work RAM, tilemap scaling and rotation, a massive color palette of 15-bit colors, translucency, flexible graphics modes, DMA, HDMA, and many other acronyms. But SNES gamers also know that it had an Achilles’ heel: speed. The pokey S-CPU wasn’t always the fastest kid on the block, and with SlowROM, it can be even slower than that. Let’s take a look at one particular game that runs slowly, Fatal Fury, and see if we can speed it up a tad.
To give this story some “human interest”, for the past few weeks, I’ve seen this sitting on my dresser. A copy of Garou Densetsu for the Super Nintendo. I found it at a thrift store alongside some jewelry; because I wanted the jewelry on my dresser, I never bothered to move the game.
The Super Nintendo port of Fatal Fury, as it’s known in the west, was a fairly successful game as far as I can tell, but also felt like an also-ran compared to the juggernaut that is Street Fighter II. Getting rid of many of Fatal Fury’s unique elements like the 2-on-1 battles, the arm wrestling, and two-line combat, the Super Nintendo version moved closer to its Capcom counterpart, though it still maintained the SNK game’s greater focus on story, along with its limited number of characters.
Still, I’m a big fan of SNK’s works, and the SNES port is still pretty solid; I imagine it was a pretty fun rental back in the 90’s. But there’s one thing about the game that’s always bothered me. Before a match, you see a screen like this:
A loading screen! Remember, the Super Nintendo never got a CD attachment; this is a ROM chip hooked up directly to the Super Nintendo. So why does it have this loading screen? And what can we do about it?
Tracing it out
First off, what’s going on during the waiting screen? For this we turn to a technique we also used back in the ALF saga: trace logging. The Super Nintendo uses a WDC 65C816 CPU, a derivative of the famous MOS 6502, but other than that the concept is the same. As a single-threaded CPU, it runs one instruction at a time; we log that instruction.
Mind you, trace logs are huge. A log just through the waiting period is a massive 272MiB.
I expected most of the time being spent to be copying to VRAM (writing to
$2118; this page is a must read); and while that did happen a lot, the vast majority of time is spent on this loop:
00871a ldx #$0000 A:0055 X:0004 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC V:117 H: 682 I:0
00871d ror $17  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC V:117 H: 706 I:0
00871f ror $16  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H: 744 I:0
008721 ror $15  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZc V:117 H: 782 I:0
008723 ror $14  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZc V:117 H: 820 I:0
008725 ror $13  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 nvMxdIZC V:117 H: 858 I:0
008727 ror $12  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC V:117 H: 896 I:0
008729 ror $11  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC V:117 H: 934 I:0
00872b ror $10  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC V:117 H: 972 I:0
00872d bcs $8738  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H:1010 I:0
00872f stz $0c20,x [000c20] A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H:1026 I:0
00872d bcs $8738  A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H:1010 I:0
00872f stz $0c20,x [000c20] A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H:1026 I:0
008732 inx A:0055 X:0000 Y:0009 S:01f2 D:0000 B:00 NvMxdIzc V:117 H:1064 I:0
008733 dec $0d [00000d] A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc V:117 H:1078 I:0
008735 bne $871d [00871d] A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc V:117 H:1116 I:0
00871d ror $17  A:0055 X:0001 Y:0009 S:01f2 D:0000 B:00 nvMxdIzc V:117 H:1138 I:0
It’s a bit more complicated than this, but what it’s doing over and over again is using the
ROR opcode on the eight byte values between
$00:0010. These are in RAM; the memory map of the SNES is a bit complicated. To make a long story short, the CPU still bears a lot of resemblance to the 65c02, with its 16-bit addresses, but it can access more memory than that. Special registers determine the “page”, which is the top byte, of a 24-bit address space. The SNES maps things into memory differently depending on which page is being accessed.
In what’s good news for Nicole Express, who doesn’t know the 65c816 very well, the M flag is set, so this is running in 8-bit mode. Specifically this is an 8-bit rotate, using the carry to spread across eight bytes, and then branching to
$8738 if carry is set. And what’s there? That’s another loop condition, increasing
x, and storing data in
$0c20,x, an address in RAM.
008738 lda [$00],y [2cfdd7] A:0055 X:0008 Y:0009 S:01f2 D:0000 B:00 NvMxdIzC V:120 H: 476 I:0
00873a sta $0c20,x [000c28] A:0003 X:0008 Y:0009 S:01f2 D:0000 B:00 nvMxdIzC V:120 H: 524 I:0
00873d iny A:0003 X:0008 Y:0009 S:01f2 D:0000 B:00 nvMxdIzC V:120 H: 602 I:0
00873e inx A:0003 X:0008 Y:000a S:01f2 D:0000 B:00 nvMxdIzC V:120 H: 616 I:0
00873f dec $0d [00000d] A:0003 X:0009 Y:000a S:01f2 D:0000 B:00 nvMxdIzC V:120 H: 630 I:0
008741 bne $871d [00871d] A:0003 X:0009 Y:000a S:01f2 D:0000 B:00 nvMxdIzC V:120 H: 668 I:0
My best guess is that this is some kind of decompression building up data starting in
00:0c20. These decompressed blocks are moved into the 128kiB of WRAM that the SNES has in pages
$7e, and then are transferred to VRAM using the Super Nintendo’s Direct Memory Access, which you can think of as being similar to the special transfer instructions on the PC Engine’s processor.
So it’s pretty much what you might have guessed. Fighting games require a lot of graphics for large, detailed characters, and corporations want to sell smaller ROMs, so compression is unsurprising. Now, why is Fatal Fury so much slower than, say, Street Fighter II? Even the weird lock-on technology version I have?
Eh, who knows. The game might just be less efficiently coded, or the compression is more aggressive– Fatal Fury is a smaller ROM than Street Fighter II. Programming is all about tradeoffs.
Well, the easiest fix is to run Fatal Fury on the Neo Geo.
Okay, okay, that was rude of me. But, why is Fatal Fury faster on the Neo Geo?
- No decompression needed; graphics are in ROMs right on the video bus, like the NES games that used CHR-ROM.
- Writes to video RAM can be done at any time. The SNES has to wait for blanking periods.
- If it needs to do heavy data processing, the Neo Geo has a 12MHz Motorola 68000 CPU, while the SNES has a… how fast is the S-CPU anyway?
A quick look at Wikipedia will tell you that the S-CPU, a Ricoh 5A22 based on a WDC 65c816 core, runs at 3.58MHz. But that’s not quite true. Much like your internet speeds, the Ricoh 5A22 runs at up to 3.58MHz. It can also run as slow as 1.8MHz! This is a variable-speed bus, which is actually an incredibly common technique. The speed depends on the address being read.
Specifically, the areas are:
- 3.58MHz: Hardware registers, fast ROM
- 2.68MHz: Slow ROM, work RAM
- 1.79MHz: Accessing certain peripherals (the controller)
I have to admit, while the most reliable sources, including the source code for the well-regarded higan emulator all say the speed drops for the RAM, I’m a bit surprised– usually, it’s easier to get faster-running RAM. I’m also not sure why the controller is so much slower; perhaps the original backwards-compatibility plans with the NES required a slow controller port that could stay in line with the shift registers in early NES controllers? I’m just speculating here.
What we actually care about here is the difference between slow ROM and fast ROM. See, this depends on what’s in the cartridge. (Of course, what’s in the cartridge was determined by Nintendo, who manufactured all licensed SNES games. And they charged dearly for the difference. Given how NEC/Hudson were selling HuCards to match with a 7.1MHz 65C02 CPU, I have to wonder if they were charging too much, but this isn’t a review of Nintendo’s business practices) In any case, it’s no surprise that Fatal Fury used slow ROM– which means that the CPU is very rarely actually running at that 3.58MHz.
So, would the 33% increase in CPU bus speed (only when accessing ROM) actually matter for us? Well, there’s no way to find out but to do it.
Converting a game to use FastROM isn’t my idea, it’s been done quite a few times. Most famously, Vitor Vilela, known for his work converting games to use the SA-1 coprocessor, has converted several games to fast ROM, including Super Castlevania IV, seeing impressive results reducing slowdown. Check that out for a picture of what proper patch code might look like, and check out kandowontu for even more FastROM games. Keep reading this blog post for a careful explanation of a hackjob.
Converting to FastROM
So, you want to convert a slow ROM game to fast ROM? It’s 2023, after all, and if your flash cart can run Donkey Kong Country (legitimately dumped from a cartridge you personally own, of course) it can keep up with that blazing 3.58MHz clock. You’d hope it’d just be a matter of going to the ROM header and changing a bit, but it’s not that easy. The ROM header on the Super Nintendo doesn’t actually do anything– it was a Nintendo requirement, and emulators may use it to get their bearings, but the console itself never reads it. (Unless the game tells it to for some reason, but then it’s just reading it like any other data in the ROM)
So you look a little deeper, and see the register
$420d. This does determine the speed of the bus when accessing ROM, so we’re definitely a bit closer. But there’s a catch: it only applies to pages
$80-$ff. These are “mirrors” of the ROM pages in
$7f are RAM); the data comes from the same place in ROM. But if you access them using their lower addresses, you’ll get the slower bus speed no matter what’s been sent to
$420d. And pretty much every slow ROM game uses the low-numbered pages, including Fatal Fury.
So what do you need to do to convert to fast ROM?
- Update the header (it’s just polite)
- Make every single ROM access do so through the higher-numbered pages: both the data bank, and code.
That is to say, to move from slow ROM to fast ROM, you actually need to change the code. It’s time to buckle up and do some disassembly.
Disassembly on the Super Nintendo
For disassembly, I used DiztinGUIsh, a disassembler designed to go with a special build of the BSNES-plus debugger-equipped emulator. See, disassembling a Super Nintendo game is a little more complicated than, say, the Z80 code needed to port a game from the Game Gear to the Master System. And it’s all because of the 65c816 CPU. But why does it need to be linked with the emulator?
You might be aware that even today, Intel and AMD x86-64 CPUs boot into what’s called “real mode”, a 16-bit mode that’s compatible with code dating back to the 1978 Intel 8086. (Well, er, close enough for our purposes, anyway) The 65c816 is very similar; it was intended to boot into an 8-bit mode compatible with the classic MOS 6502. That’s what allowed the Apple IIgs to run original Apple ][ software, and over here in the video game console world, allowed Super Mario All-Stars to share a lot of code with the NES releases.
On the Z80 or 6502 or 8086, a lot of the difficulty of disassembling is distinguishing code from data, and lining up where the code starts. Instructions on these processors are variable-length; a 6502
NOP is a single byte, while a jump to a fixed location,
JMP $AABB, is three bytes; so you need to know where to start reading code from. Disassembling data, on the other hand, will just give you irrelevant nonsense. (It’s always possible for assembly code to do all sorts of fancy nonsense complexifying this nice picture, but it’s good enough for most code)
All of this is also true on the 65c816. But there’s an added complexity. An instruction’s length can also change depending on whether the processor is in 16-bit mode or 8-bit mode. And the mode could be changed at any point in the program; it’s a global state. And unlike your average x86-64 program, 65c816 programs change mode all the time. Even knowing a certain block beginning at a certain address is code isn’t enough; you need to know what mode it’s expecting to run in.
A way around this is by linking with an emulator playing the game in real time, you let the disassembler know what state the processor is in, and how to proceed. The downside is, you actually need to play the game. This is basically the same thing as the trace log above. I’ve never done this before, but the tutorial on DiztinGUIsh’s GitHub seemed pretty straightforward.
We don’t actually need to play that much of the game. This is mostly a demonstration piece, so I really only need to play through the loading part. But this is good to know if you want to do this yourself, and do a good job: you’ll need to be able to play through as much of the game as you can, to trigger as many weird code cases.
Rewriting to use FastROM
Changing the code bank
One thing I found in my trace is that all the code in Fatal Fury that I found is in bank
00. This is great, because it means that once we start executing code, we don’t need to worry about changing banks. But remember, we’re in bank
00; this means we’re in the slow part of the memory map. And how did we get to bank
00? We need to learn a bit more about the 65c816.
The 65c816 always boots up in 8-bit mode. Specifically, exactly the 6502 with which it is backwards compatible, it boots up and reads the “reset vector”, a two-byte address located at address
00:FFFC. It then jumps to that two-byte address; since it’s a two-byte address, it can’t change the bank, but that’s not a huge deal. What we need is a trampoline function; something that will sit between our reset vector (address
8000) and jump to the fast ROM version located in bank
And, when we’re writing our trampoline, we might as well also write
$01 to that register at
420d, and get that out of the way.
I found a row of zeroes (which hopefully is unused) in bank 00 just above
00:FD6A, and wrote the following:
RESET_VECTOR: LDA.B #$01
STA.B are 8-bit (“byte”) versions of the load and store instructions; since we’re in 8-bit mode when the processor starts up, we know that’ll be the correct ones to use.
JML is the 24-bit version of the jump instruction; you might sometimes see it as
JMP.L, but the Asar assembler DiztinGUIsh uses wants it this way. Notice that even if we’re in 8-bit mode, we can still do 65c816-specific things.
So it worked, and now we’ve also set the fast ROM speed. But that wasn’t enough; I found that code was still executing from bank
00. What gives?
On the Super Nintendo, you get an interrupt every 60Hz frame. This is common for game consoles of the time, and is pretty much how they keep time. On the 65c816, interrupts work the same way they do on the 6502; you have a maskable interrupt
IRQ, and the non-maskable
NMI. (Maskable means whether the processor can internally choose to ignore the interrupt, or whether it will always happen– they also execute from different addreses.) The S-PPU frame timer is
NMI. There are two vectors for NMI; in 16-bit mode, it’s at
00:FFEA, and in 8-bit mode, it’s at
00:FFFA. But note that in either case, they’re just two bytes.
Fatal Fury uses the same interrupts for 8-bit and 16-bit mode; at
00:FFFA we see a link to the bank 00 address
80EE. But what’s there?
NMI: JML.W [$0100]
A long-jump like we saw before, but this time, to a location that’s listed in RAM. This is a pretty common technique; when the game is doing different things, it might want to call a different function for the
NMI. The 65c816 will always follow the same vectors from the same location, and this game maps that location to ROM, so you need a trampoline.
This is a 24-bit jump, so there’s three bytes in RAM; the page is in address
0x0102 in RAM. I guess the developers wanted to have the possibility to store code in other banks if they needed to. Since I don’t see any cases where they do (but I didn’t disassemble the whole ROM), it seems like we just need to write
0x80 to that byte, and hope that it just stays that way. I used a breakpoint in the bsnes-plus emulator to make sure that address isn’t written to, and it never triggered after the first write, so I think we’re in business.
The reset vector code is the only place that writes to that byte of RAM, specifically it does so when it clears out all of memory with
00. That’s just good practice.
LDA.W #$FFFF ; set up the block move
LDX.W #$0000 ; ditto
LDY.W #$0001 ; ditto
MVN $7E, $7E ; block move; code address 00:8038
This is a bit complicated to me, who doesn’t know the 65c816 that well, but what I believe it’s doing is storing the byte
00 to the first bank of RAM (look at the memory map: the top of the RAM at bank
7e is the same “low RAM” at bank
00), and then copying that all across using a 65cx816-exclusive block move opcode
MVN, which copies from one bank to the next one byte a time. So it just copies that one
00 all across the first 64k of RAM.
It was easiest to find another empty block and just replace that
MVN $7E, $7E with a jump. (They’re both three byte instructions)
SET_HIGH_BANK: MVN $7E,$7E
So it clears the block, then writes a single byte (notice that I’m in 16-bit mode, so I have to write two bytes), then sets everything back to the prior state.
Notice I had to end in another jump, rather than an
RTS return. That’s because it just cleared RAM, which includes the stack! If I tried to return, it’d have lost the return address. (Ask me how I know) Normally you’d just include the code inline, but I’m trying to move things around as little as possible since my disassembly is incomplete.
So now our code is running from bank
80, and will stay that way long enough for our test purposes. Fetching instructions should run fast. But what about the data we work with?
Changing the data bank
So, to change the page for data, there’s a special register called the data bank register. For whatever reason, the only way to change it is a
PLB; pull the top value from the stack into the data bank register. Take this code from the disassembly:
This loads a 0 into the A register, pushes it onto the stack, which will put two bytes on if we’re in 16-bit mode (as we are here), then pulls the two bytes back into the data bank register, which is always one byte. However, in this case, the next
PLB is actually just pulling a byte register that was already on the stack back to it. So this can definitely get complicated. Thankfully, for most of the time we’ll want
PLB preceded by
PHA, as that’s the pattern this code seems to use to set a new value.
As noted, looking at the disassembly, it looks like all of the code (that I found, anyways!) is in bank
00. So setting the data bank to
00, which is what it’s doing here, basically just means that I want to load from the code segment.
So you might think it’s just a matter of finding all
PLB, then changing the value in the accumulator by adding
$80. And that’s true, but with some exceptions and complicated.
$7E, for example, are mapped to RAM, and shouldn’t be changed–
$FF are not mapped to RAM.
However, looking at the code, while there are cases like the above, it seems that the most common setup was to use
LDA.B $48,X, after putting something in the
X register. This is a zero-page addressing mode; initially I tried to find the table in RAM (at
$48, of course) it’s referring to, but I found it easier to write a small subroutine that loads in the bank, makes sure it’s less than
$7E, and then sets the bit for the fast ROM region and sets the data bank.
CMP.B #$7E ; Are we looking for RAM?
BCS DONE_LOAD_DATA ; Branch if the bank is above $7E
ORA #$80 ; Otherwise add $80
PLB ; Set the register
Doing this through a subroutine makes it slightly slower, which kind of defeats the point if this is called too often.
Does it work?
So, I’m judging here by starting a match between Terry Bogard and Duck King, and seeing how long the text “WAITING” is on screen. The game is being played in both cases from a Super Everdrive, on this very pretty Super Nintendo belonging to my lovely girlfriend.
That’s right! Through all this effort we’ve gained a massive 10% decrease in loading time. It’s not the full speedup, of course, because any code touching RAM or the registers will run at the same speed. But still, isn’t that nice? It’s worth noting that this method of measurement is only sensitive up to a frame. The SNES runs at about 60fps, so this is about ten frames of difference.
Well okay, honestly, it’s not huge in reality– a tenth of a second probably wouldn’t be noticed. Which is why Takara didn’t bother buying Fast ROM from Nintendo. And why doesn’t it matter? Remember, the code is most likely decompressing graphics into RAM and then copying them into video RAM using DMA. Which is to say, all those
ROR will run a bit faster, but everything else can’t speed up.
So what’s the difference between Fatal Fury and Super Castlevania IV? In Super Castlevania IV, the problem being solved was slowdown. A 10% speedup that causes you to not overrun a frame makes all the difference. In Fatal Fury, though, we’re just looking at loading screens. Probably the best way to speed this up would be to use a bigger ROM and store uncompressed data; I’ll leave that to someone else. Here’s an IPS patch you can apply if you want to try my changes for yourself.
The dog still wasn’t very shaggy
So, while I started with the Japanese version of Garou Densetsu I happened to have on my dresser, I immediately jumped to the US release of Fatal Fury. So, why is that?
Well, there are quite a few regional differences between Garou Densetsu and Fatal Fury. The pre-battle portraits are different. The Japanese version has no title screen scroll about the King of Fighters tournament; the US one has no blue Takara logo screen. But one notable one is that the “WAITING” screen doesn’t exist at all. It still takes the same amount of time to load; it just doesn’t give you any feedback while it does so.