A 6502 Emulator
Introduction⌗
6502 is a 8-bit processor with a 16-bit address bus. The production launched in 1975, and the processor’s clock ran at around 1-3 MHz. I like 6502 and I think that it’s a notable processor & architecture for many reasons:
- 6502 is simple. The base instruction set is just 56 instructions. The processor has a single accumulator,
A
, and two index registers -X
andY
. It also has a stack pointer and an instruction pointer, but these are transparent to the programmer. - 6502 is cheap. The processor had just around 3'500 transistors. Consequently, when it was launched, 6502 was the least expensive microprocessor on the market. It initially sold for <~ 1/6 the cost of competing CPUs - for instance, Intel 8080.
- 6502 is historically relevant. Its introduction caused rapid decreases in pricing across the processor market. Along with the Zilog Z80, it sparked a series of projects that resulted in the home computer revolution of the early 1980s.
- 6502 is relevant even nowadays! Because of it’s simplicity, cost, and power consumption, CMOS form of 6502 (65C02 developed by WDC), the 6502 family is widely used in embedded systems, with estimated production volumes in the hundreds of millions. 6502 shares this fate with 8051.
But.. why are these still relevant? Isn’t 6502 and 8051 too old? Well. Because they’re cheap, because they’re no longer covered by Intel’s and MOS Technology patents, because what’s referred to today as “8051” or “6502” isn’t a replica of what Intel and MOS Technology made in the 1970s and 1980s, but rather an ecosystem of software-compatible derivatives: same instruction set, same CPU registers, same memory architecture, and the list goes on and on. The 6502 and 8051 design has been refined by many silicon vendors over the decades who sell 8051-based microcontroller products, and because of that, these old 8-bit designs still have usefulness and relevance in the today’s market. They’re used in industrial control systems (industry hates changing “working” solutions - we covered that earlier), high-performance(!) devices, low power devices (c8051f98x) and IP cores of FPGAs. Actually, Furby runs 6502, just like many MP3 players, mouses, keyboards and monitors. Everything.
A deep dive⌗
Let’s set the goals first. With our emulator, we will want to achieve the following:
- small, portable code.
- decimal mode implemented.
- 100% coverage of legal opcodes.
- emulating hardware glitches.
What isn’t the goal:
- cycle accurancy
- performance
6502 supports four kinds of interrupts - the one raised by the BRK
instruction, IRQ, NMI and a RESET interrupt. 6502 also has a stack. Because of the memory addressing limitations, only 65K of memory is needed. Knowing these, let’s start our emulator.
The vectors below define where is the code handling some interrupt. We’ll keep these values, but if you want to build your own 6502-centered machine, you can tweak these to your liking.
Generally speaking, since 6502 programs do a lot of 8-bit loads, we will keep our memory as a pointer to a vector of 8-bit integers. The program counter (or instruction pointer) is 16-bit, and we have 4 another 8-bit variables - A
, X
, Y
and SP
. A sharp eye will notice that this limits the stack to 256 bytes, but this isn’t a huge problem.
6502 can be considered a predecessor of RISC architectures. It’s well known for having just a single accumulator and a large amount of addressing modes. Let’s look at these first:
-
Accumulator: the operation works on the accumulator. For example -
ASL A / ROL A / ROR A
(/
representing newline). A lot of assemblers allow omitting the explicitA
, soASL / ROL / ROR
would compile too. -
Absolute: data is accessed using a constant, 16-bit address. For example -
LDA $0B26 / STY $1FE7
. -
Absolute with X or Y: data is accessed using a constant, 16-bit value added to the
X
orY
register. For example -STA $3FFF, X
stores the accumulator at$3FFF + X
, andSTA $3FFF, Y
stores the accumulator at$3FFF + Y
. -
Immediate: the operand is taken from the byte following the opcode. For example -
LDA #$FF
-
Implied: the instruction takes no arguments. Similar to the accumulator addressing mode. For example -
RTS
,RTI
. -
Indirect: data is accessed by dereferencing a constant pointer. For example -
JMP ($B800)
jumps to the location pointed by a word at$B800
(lower byte$B800
, higher byte$B801
). -
Indirect with X: An 8-bit ZP address and the X register are added. If the addition overflows, the address wraps around within ZP. The resulting address is used as a pointer to the data being accessed. This makes the X register an index into a list of pointers. For example -
LDA ($05, X)
. -
Indirect with Y: An 8-bit address identifies a pointer. The value of the Y register is added to the address contained in the pointer. Effectively, the pointer is the base address and the Y register is an index past that base address. There’s a (subtle) difference between this and indirect with X addressing mode. For example -
LDA ($05), Y
. -
Relative: An 8-bit signed value is added to the current instruction pointer. The branch target must be in the range of
(-128, 127)
bytes of the current value. For example -BEQ $0300
. -
ZP (zero page): An 8-bit address is used within the zero page (the area of memory from offset
$00
addressable using a single byte). For example -ORA $F0
. -
ZP with X or Y: analogical to the absolute with X or Y addressing mode.
Now, since 6502 has quite a bit of valid opcodes, I decided that we’ll use a bunch of macros to make the code cleaner:
And the explanations:
R
is a shorthand forreturn
.ZFN
is a zero argument function.MFN
is a monadic (one argument) function.DFN
is a dyadic (two argument) function.
When working with packed flags, we’ll need a macro to elegantly unpack n-th bit from a bit vector. We’ll call it SHF
, a shorthand for SHift Flags.
As you have seen above, the addressing modes are quite repetitive, so we will add macros that create register & memory addressing modes:
REGO
is a shorthand for register operand and MEMO
is a shorthand for memory operand. We assume that both are monadic functions. Functions with register operands will have their name end with R
, while functions with memory operands will have their name end with A
.
The memory operand macro is using R8
and W8
functions - respectively, fetch (read) a byte (8-bit value) from memory and write a byte to memory.
Finally, we will define a macro that will be used in the instruction decoding switch..case
statement:
IH
is a shorthand for Instruction Handler. It breaks from the previous case value (to stop fallthrough - we don’t want it anywhere in the decoding switch..case
statement) and starts a new case label.
Memory⌗
Let’s implement a few basic functions for interfacing with memory.
R8
is a monadic function taking the address (a u16
) and returning a byte (u8
) located at that address. R16
does exactly the same, but it reads a little-endian word instead. So, we read the u8
, but we also take the byte that follows it (A+1
) and shift it a byte left, to make a word.
There’s a known 6502 bug that has been patched in 65C02. The low byte is sometimes wrapped without incrementing the high byte. The implementation isn’t much different from R16
.
Writing to memory is fairly simple. We won’t need W16
in our emulator, so we implement just W8:
Next up, stack operations. We need to support POP8
, PSH8
, PSH16
and POP16
. Let’s do that:
On 6502, the stack grows downward, so pushing an 8-bit value is just writing a byte to a location pointed by sp
accounting the stack offset and decrementing the stack pointer, while popping an 8-bit value is just incrementing the sp
, adjusting for the stack offset, and reading a value.
Pushing and popping 16-bit numbers is a bit more complex. In PSH16
, we first take the value of stack pointer and fix it with the stack offset. Then, we push the higher bits (extracted using A>>8
) and the lower bits (extracted using A&0xFF
). This operation changes the stack pointer by two (the size of a word). POP16
is fairly straightforward too - popping a value and masking it with another popped value, which will constitute the higher bits of newly created word (A<<8
).
Moving swiftly on, let’s take on the implementation of addressing modes, starting with the zero page:
We read an immediate which is the address within the zero page, then dereference it. ZPX
and ZPY
account for X and Y-wise addressing. Absolute addressing requires reading a word. X and Y-wise addressing looks exactly the same in this addressing mode. Because absolute addressing itself isn’t used all that often, we won’t implement it as a function.
Relative addressing requires the use of signed arithmetic. Because not all architectures support it, it’d be smart to leave a comment in the source code explaining this.
Finishing the addressing saga, we implement the immediate addressing:
Flags⌗
6502 has a set of 7 flags. They are generally represented as a single byte, but the 5th flag is left unused (fixed to 1
).
Flag | Description |
---|---|
NF | Negative flag |
VF | Overflow flag |
** | Unused, fixed to 1 |
BF | Break flag |
DF | Decimal flag (BCD) |
IDF | Interrupt (IRQ) disable flag |
ZF | Zero flag |
CF | Carry flag |
A lot of operations will alter the zero and negative flag (ZF
and NF
), so let’s implement a function that adjusts them based on an operation result:
Since 6502 has instructions that push and pop the processor’s flags, we need a way to serialize and deserialize flags into a single byte:
Interrupts⌗
We talked for a brief moment about interrupts, but we didn’t quite explain them in enough detail. Some code will aid us:
So, nmi
, res
and irq
are a part of our “public” API. nmi
and res
clear the bf
flag and call INTR
with the corresponding vector, meanwhile irq
does it only if idf
(the interrupt/IRQ disable flag) is not set.
The INTR
function is a bit more interesting. We can notice that it:
- pushes the current instruction pointer on the stack
- pushes the flags on the stack
- loads the new instruction pointer from memory at
A
(the argument to INTR, usually it’s the value ofNMIV
,IRQV
, etc…) - disables IRQ/interrupt handling (because we’re already inside an interrupt!)
Arithmetics & bit operations⌗
You probably remember the decimal flag we talked about earlier. BCD (Binary Coded Decimal) numbers can be directly added or subracted on the 6502 by using decimal mode. The basics of decimal mode are generally fairly simple, but the specifics of its operation and use aren’t always well documented or explained.
A byte has 256 possible states - $00
to $FF
. These values may represent any sort of data. The most common way of representing numbers is as a binary number (or more specifically, an unsigned binary integer), where FF represents numbers 0 … 255. In BCD, a byte represents a number from 0 to 99, where the byte values represented in hexadecimal are treated as decimals. So, for example $39
is 39, $05
is 5, $99
is 99. The largest value such a decimal can take is thus $99
and it can’t contain hexadecimal digits (ABCDEF
). In other words, the upper digit (0 to 9) of the BCD number is stored in the upper 4 bits of the byte, and the lower digit is stored in the lower 4 bits. These 100 values are called valid BCD numbers. The other 156 possible values of a byte (i.e. where either or both hex digits are A to F) are called invalid BCD numbers. By contrast, all 256 possible values of a byte are valid binary numbers.
When $28
is a BCD number it represents 28 (in decimal), and when $28
is a binary number it represents 40 (in decimal).
The term BCD arithmetic means arthmetic that involves BCD numbers, whereas the term binary arithmetic means arithmetic that involves binary numbers. The same naming convention applies to addition and subtraction. For example, $19 + $28 = $47
using BCD addition (df = 1
), but $19 + $28 = $41
(i.e. 25 + 40 = 65) using binary addition.
6502 code usually assumes that the processor in is binary mode (df = 0
) upon entry, and these programs will often return in binary mode (usually because they don’t affect df
). In this case, BCD calculations are performed by starting with a SED
instruction, followed by the instructions to perform calculation itself, and finishing with an CLD
instruction.
Let’s look at the ADC
instruction implementation now:
Big and scary, but we can manage. First, let’s notice that this implementation basically boils down to:
V
is the other operand for addition (read from memory). So, because the binary implementation is simpler, we will now zoom on it:
First, we calculate the addition result - accumulator, the 2nd operand and the carry. Then, we adjust zf
and nf
. The rest of the code sets flags approperiately. (~(a^V)&(a^r)&0x80)
detects if the addition has overflowed, r&0xFF00
extracts the carry (everything that doesn’t fit to a byte, which could be also rewritten as r>0xFF
) and a=r&0xFF
sets the addition result.
Wait a second. How does this obscure carry formula work? Let the carry out of the full adder adding the least significant bit be called . Then, the carry out of the full adder adding the next least significant bit is . Thus, the carry out of the full adder adding the most significant bits is . This assumes that we are adding two bit numbers. We can write the formula as . This is effectively XORing the carry-in and the carry-out of the leftmost full adder. Why does this work? The XOR of the carry-in and carry-out differ if there’s either a 1 being carried in, and a 0 being carried out, or if there’s a 0 being carried in, and a 1 being carried out.
If you thought that this is hard, the BCD part is even worse. There are a dozen of sequences for performing this operation on binary-coded decimals:
Sequence 1:
AL = (A & $0F) + (B & $0F) + C
- If
AL >= $0A
, thenAL = ((AL + $06) & $0F) + $10
A = (A & $F0) + (B & $F0) + AL
A
can be>= $100
at this point- If
A >= $A0
, thenA = A + $60
- The accumulator result is the lower 8 bits of
A
- The carry result is 1 if
A >= $100
, and 0 ifA < $100
Sequence 2:
AL = (A & $0F) + (B & $0F) + C
- If
AL >= $0A
, thenAL = ((AL + $06) & $0F) + $10
A = (A & $F0) + (B & $F0) + AL
, using signed (twos complement) arithmetic- The
N
flag result is 1 if bit 7 ofA
is 1, and 0 if bit 7 ifA
is 0 - The
V
flag result is 1 ifA < -128
orA > 127
, and 0 if-128 <= A <= 127
On 6502, sequence 1 is used to compute A
and C
, while sequence 2 is used to compute N
and V
. Z
is based on the BCD accumulator result. For instance, 99 + 1
in decimal mode yields the accumulator to be equal to $00
:
SED
CLC
LDA #$99
ADC #$01
… and this is the algorithm the bit of code above implements. Moving on, to SBC
:
As most of it is isomorphic to the description before, I’ll skip the binary part and describe just the decimal case of this algorithm:
AL = (A & $0F) - (B & $0F) + C-1
- If
AL < 0
, thenAL = ((AL - $06) & $0F) - $10
A = (A & $F0) - (B & $F0) + AL
- If
A < 0
, thenA = A - $60
- The accumulator result is the lower 8 bits of
A
On 6502, this algorithm is used to compute the value of A
. C
, V
, N
and Z
flags are computed the same way as in the binary mode.
Time for something more straightforward now. Let’s implement increment and decrement:
Nothing to see here, just +1 / -1
and adjusting the Z
and N
flags. A bunch of other operations:
Branching⌗
This will be fairly simple. Let’s start with a conditional jump that jumps if the flag provided is set:
We also want an unconditional jump:
A thing that I haven’t explained earlier (when we were discussing interrupts) is “how to return from an interrupt?” The answer is fairly simple:
First, we pop off the flags, then we pop off the instruction pointer. So we’re doing the exact reverse of what happened in the INTR
function. Speaking of returning, we can implement JSR
and RTS
now:
… and CMP
:
The CMP
instruction is used for comparisons as the mnemonics suggests. The way it works is simple - they just perform a subtraction. Strictly speaking, CMP A
is similar to SEC / SBC N
. Both of these variants affect the N
, Z
, and C
flags in exactly the same way. However, unlike SBC
:
CMP
is not affected by df- The accumulator is not affected by the operation
- the V flag is not affected
A useful property of CMP is that it performs an equality comparison and an unsigned comparison. After CMP, the Z
flag contains the equality comparison result and the C
flag contains the unsigned comparison result, specifically:
Z = 0 => A != N
,BNE
will branch.Z = 1 => A == N
.BEQ
will branch.C = 0 => A < N
,BCC
will branch.C = 1 => A >= N
,BCS
will branch.
Additional operand combinations⌗
Many instructions take additional operand combinations:
…according to this table:
mnemonic | register operand | memory operand |
---|---|---|
INC |
X | X |
DEC |
X | X |
ROL |
X | |
ROR |
X | |
ASL |
X | |
LSR |
X |
State initialisation and step function⌗
We’re nearly done! Everything left now is initialising the processor state and implementing a step function. Let’s do that now:
The initialisation functions gives some default value to SP
(just in case the program doesn’t initialise it for any reason), clears all the flags, the instruction pointer, address registers and the accumulator.
Finally, the code for the stepping function follows:
There are many sources for 6502 instruction encodings, but I like this one. Maybe you’ll like it too.
Testing our emulator⌗
To test our emulator, let’s tack some code on the bottom:
This small routine will read a ROM image to memory and put it on a specified offset.
Now, we load a testing memory image (available in the source code repository mentioned down below alongside a disassembly and a bunch of other files) at #$4000
. An infinite loop calls step();
until the memory pointer reaches #$45C0
, in which case, if the byte under #$0210
is #$FF
, the test has passed. Otherwise, it has failed.
A tiny bit of driver code will run this example program:
… and, as expected, the tests passed!~
0 [20:23] ~/workspace/6502-emu % gcc 6502.c -o 6502
0 [20:23] ~/workspace/6502-emu % ./6502
passed
Klaus Dormann’s testing suite⌗
This one will be a bit tricky, because Klaus’ testing suite makes some assumption about our very own homebrew 6502 computer that aren’t true for our model. But we can easily fix that.
Let’s start off with changing the “tweakable” defaults:
… and then create a new testing routine:
The program’s expected entry point lies at #$0400
. If the program ends up in an infinite loop at any point, we assume that the test either passed or failed (which is why we keep track of prev_pc
; if something goes really wrong, we also have a check for the test taking too many cycles - ins
and the if
statement near the end guards against that).
If the program loops at pc=#$3469
, we assume it succeeded. Let’s try it now:
… and it works! This means that our 6502 emulator is a perfect replica in terms of functionality of a genuine 6502. Let’s look at our checklist:
Let’s set the goals first. With our emulator, we will want to achieve the following:
- small, portable code (check)
- decimal mode implemented (check)
- 100% coverage of legal opcodes (check)
- emulating hardware glitches (check) What isn’t the goal:
- cycle accurancy
- performance
Making a disassembler⌗
When I was digging through 6502 testing suites, I found a small binary blob that had a description attached to it (it explained what kind of hardware does it expect). I really wanted to know what’s exactly inside of it, and to do that I’d have to disassemble the hex code.
I thought it’d be a nice project to make my own 6502 disassembler to go with the emulator. So, let’s do that:
I copied a bunch of instruction tables from the internet. Now, we can dive into the actual disassembler:
First, we determine the size of an instruction and we load it into our variables.
… then, we display the hexadecimal dump of said instruction alongside its address.
Since the addressing modes on 6502 are really simple, and this is all it takes to disassemble a mnemonic, we can just switch over the addressing mode used by that particular opcode. Finally, we skip to a new line, go to the next instruction, and return the current address.
Now, we add a main function stub that runs disassemble
on itself as long as there’s input on stdin
:
… but, wait. Some programs, like the one above, required us to put them on a specific offset in memory. Our assembler should also support that. So let’s do that - it’s just a simple patch that comes before the while
loop:
There’s a small problem with this code - if I want my code on offset #$0400
, I don’t want to have to convert it to decimal. It’s just inconvenient. So, let’s replace atoi
with a call to our function that detects the base and converts the number:
Done! Now we can run it on this mysterious program:
… and it seems like it worked!