Introduction

6502 is a 8-bit processor with a 16-bit address bus. The production launched in 1975, and the processor’s clock ran at around 1-3 MHz. I like 6502 and I think that it’s a notable processor & architecture for many reasons:

  • 6502 is simple. The base instruction set is just 56 instructions. The processor has a single accumulator, A, and two index registers - X and Y. It also has a stack pointer and an instruction pointer, but these are transparent to the programmer.
  • 6502 is cheap. The processor had just around 3'500 transistors. Consequently, when it was launched, 6502 was the least expensive microprocessor on the market. It initially sold for <~ 1/6 the cost of competing CPUs - for instance, Intel 8080.
  • 6502 is historically relevant. Its introduction caused rapid decreases in pricing across the processor market. Along with the Zilog Z80, it sparked a series of projects that resulted in the home computer revolution of the early 1980s.
  • 6502 is relevant even nowadays! Because of it’s simplicity, cost, and power consumption, CMOS form of 6502 (65C02 developed by WDC), the 6502 family is widely used in embedded systems, with estimated production volumes in the hundreds of millions. 6502 shares this fate with 8051.

But.. why are these still relevant? Isn’t 6502 and 8051 too old? Well. Because they’re cheap, because they’re no longer covered by Intel’s and MOS Technology patents, because what’s referred to today as “8051” or “6502” isn’t a replica of what Intel and MOS Technology made in the 1970s and 1980s, but rather an ecosystem of software-compatible derivatives: same instruction set, same CPU registers, same memory architecture, and the list goes on and on. The 6502 and 8051 design has been refined by many silicon vendors over the decades who sell 8051-based microcontroller products, and because of that, these old 8-bit designs still have usefulness and relevance in the today’s market. They’re used in industrial control systems (industry hates changing “working” solutions - we covered that earlier), high-performance(!) devices, low power devices (c8051f98x) and IP cores of FPGAs. Actually, Furby runs 6502, just like many MP3 players, mouses, keyboards and monitors. Everything.

A deep dive

Let’s set the goals first. With our emulator, we will want to achieve the following:

  • small, portable code.
  • decimal mode implemented.
  • 100% coverage of legal opcodes.
  • emulating hardware glitches.

What isn’t the goal:

  • cycle accurancy
  • performance

6502 supports four kinds of interrupts - the one raised by the BRK instruction, IRQ, NMI and a RESET interrupt. 6502 also has a stack. Because of the memory addressing limitations, only 65K of memory is needed. Knowing these, let’s start our emulator.

/* Memory size */
#define MEM 0x10000
/* Stack start address */
#define STK 0x200
/* BRK vector */
#define BRKV 0xFFF0
/* NMI vector */
#define NMIV 0xFFF4
/* Reset vector */
#define RESV 0xFFF8
/* IRQ vector */
#define IRQV 0xFFFB

The vectors below define where is the code handling some interrupt. We’ll keep these values, but if you want to build your own 6502-centered machine, you can tweak these to your liking.

Generally speaking, since 6502 programs do a lot of 8-bit loads, we will keep our memory as a pointer to a vector of 8-bit integers. The program counter (or instruction pointer) is 16-bit, and we have 4 another 8-bit variables - A, X, Y and SP. A sharp eye will notice that this limits the stack to 256 bytes, but this isn’t a huge problem.

#include <stdint.h>
#include <assert.h>

/* All the integer types we use. Abstain from using 32-bit data types.
   We use _Bool for flags (we don't pack them into a single byte for
   simplicity, even though it'd be worth it). The only place where signed
   8-bit arithmetic is needed is relative addressing. Modifying the
   code to use only unsigned arithmetic is left as an exercise
   to the reader :).  */
typedef uint8_t u8;
typedef uint16_t u16;
typedef int8_t i8;
typedef _Bool n1;

/* The memory array, instruction pointer, registers and flags. */
u8*m;u16 pc;u8 a,x,y,sp;
n1 cf,zf,idf,df,bf,vf,nf;

6502 can be considered a predecessor of RISC architectures. It’s well known for having just a single accumulator and a large amount of addressing modes. Let’s look at these first:

  • Accumulator: the operation works on the accumulator. For example - ASL A / ROL A / ROR A (/ representing newline). A lot of assemblers allow omitting the explicit A, so ASL / ROL / ROR would compile too.

  • Absolute: data is accessed using a constant, 16-bit address. For example - LDA $0B26 / STY $1FE7.

  • Absolute with X or Y: data is accessed using a constant, 16-bit value added to the X or Y register. For example - STA $3FFF, X stores the accumulator at $3FFF + X, and STA $3FFF, Y stores the accumulator at $3FFF + Y.

  • Immediate: the operand is taken from the byte following the opcode. For example - LDA #$FF

  • Implied: the instruction takes no arguments. Similar to the accumulator addressing mode. For example - RTS, RTI.

  • Indirect: data is accessed by dereferencing a constant pointer. For example - JMP ($B800) jumps to the location pointed by a word at $B800 (lower byte $B800, higher byte $B801).

  • Indirect with X: An 8-bit ZP address and the X register are added. If the addition overflows, the address wraps around within ZP. The resulting address is used as a pointer to the data being accessed. This makes the X register an index into a list of pointers. For example - LDA ($05, X).

  • Indirect with Y: An 8-bit address identifies a pointer. The value of the Y register is added to the address contained in the pointer. Effectively, the pointer is the base address and the Y register is an index past that base address. There’s a (subtle) difference between this and indirect with X addressing mode. For example - LDA ($05), Y.

  • Relative: An 8-bit signed value is added to the current instruction pointer. The branch target must be in the range of (-128, 127) bytes of the current value. For example - BEQ $0300.

  • ZP (zero page): An 8-bit address is used within the zero page (the area of memory from offset $00 addressable using a single byte). For example - ORA $F0.

  • ZP with X or Y: analogical to the absolute with X or Y addressing mode.

Now, since 6502 has quite a bit of valid opcodes, I decided that we’ll use a bunch of macros to make the code cleaner:

#define R return
#define ZFN(r,n) static inline r n(void)
#define MFN(r,n,a1) static inline r n(a1 A)
#define DFN(r,n,a1,a2) static inline r n(a1 A, a2 B)

And the explanations:

  • R is a shorthand for return.
  • ZFN is a zero argument function.
  • MFN is a monadic (one argument) function.
  • DFN is a dyadic (two argument) function.

When working with packed flags, we’ll need a macro to elegantly unpack n-th bit from a bit vector. We’ll call it SHF, a shorthand for SHift Flags.

#define SHF(x,b)x=(A>>b)&1

As you have seen above, the addressing modes are quite repetitive, so we will add macros that create register & memory addressing modes:

#define REGO(n)MFN(void,n ## R,u8*){*A=n(*A);}
#define MEMO(n)MFN(void,n ## A,u16){u8 v=R8(A);W8(A,n(v));}

REGO is a shorthand for register operand and MEMO is a shorthand for memory operand. We assume that both are monadic functions. Functions with register operands will have their name end with R, while functions with memory operands will have their name end with A.

The memory operand macro is using R8 and W8 functions - respectively, fetch (read) a byte (8-bit value) from memory and write a byte to memory.

Finally, we will define a macro that will be used in the instruction decoding switch..case statement:

#define IH(V) ;break;case V:

IH is a shorthand for Instruction Handler. It breaks from the previous case value (to stop fallthrough - we don’t want it anywhere in the decoding switch..case statement) and starts a new case label.

Memory

Let’s implement a few basic functions for interfacing with memory.

MFN(u8,R8,u16){R m[A];}
MFN(u16,R16,u16){R(m[A+1]<<8)|m[A];}

R8 is a monadic function taking the address (a u16) and returning a byte (u8) located at that address. R16 does exactly the same, but it reads a little-endian word instead. So, we read the u8, but we also take the byte that follows it (A+1) and shift it a byte left, to make a word.

There’s a known 6502 bug that has been patched in 65C02. The low byte is sometimes wrapped without incrementing the high byte. The implementation isn’t much different from R16.

MFN(u16,RB16,u16){R(m[(A&0xFF00)|((A+1)&0xFF)]<<8)|m[A];}

Writing to memory is fairly simple. We won’t need W16 in our emulator, so we implement just W8:

DFN(void,W8,u16,u8){m[A]=B;}

Next up, stack operations. We need to support POP8, PSH8, POP8 and POP16. Let’s do that:

MFN(void,PSH8,u8){W8(STK+sp--,A);} ZFN(u8,POP8){R R8(++sp+STK);}
MFN(void,PSH16,u16){u16 l=STK+sp;W8(l,A>>8);W8(l-1,A&0xFF);sp-=2;}
ZFN(u16,POP16){R POP8()|(POP8()<<8);}

On 6502, the stack grows downward, so pushing an 8-bit value is just writing a byte to a location pointed by sp accounting the stack offset and decrementing the stack pointer, while popping an 8-bit value is just incrementing the sp, adjusting for the stack offset, and reading a value.

Pushing and popping 16-bit numbers is a bit more complex. In PSH16, we first take the value of stack pointer and fix it with the stack offset. Then, we push the higher bits (extracted using A>>8) and the lower bits (extracted using A&0xFF). This operation changes the stack pointer by two (the size of a word). POP16 is fairly straightforward too - popping a value and masking it with another popped value, which will constitute the higher bits of newly created word (A<<8).

Moving swiftly on, let’s take on the implementation of addressing modes, starting with the zero page:

ZFN(u8,ZPG){R R8(pc++);}ZFN(u8,ZPX){R ZPG()+x;}ZFN(u8,ZPY){R ZPG()+y;}

We read an immediate which is the address within the zero page, then dereference it. ZPX and ZPY account for X and Y-wise addressing. Absolute addressing requires reading a word. X and Y-wise addressing looks exactly the same in this addressing mode. Because absolute addressing itself isn’t used all that often, we won’t implement it as a function.

ZFN(u16,ABS){u16 A=R16(pc);pc+=2;R A;}
ZFN(u16,ABX){R ABS()+x;}ZFN(u16,ABY){R ABS()+y;}

Relative addressing requires the use of signed arithmetic. Because not all architectures support it, it’d be smart to leave a comment in the source code explaining this.

ZFN(i8,REL){R(i8)R8(pc++);} // XXX: Relative addressing requires signed arithmetic.

Finishing the addressing saga, we implement the immediate addressing:

ZFN(u16,IMM){R pc++;}

Flags

6502 has a set of 7 flags. They are generally represented as a single byte, but the 5th flag is left unused (fixed to 1).

Flag Description
NF Negative flag
VF Overflow flag
** Unused, fixed to 1
BF Break flag
DF Decimal flag (BCD)
IDF Interrupt (IRQ) disable flag
ZF Zero flag
CF Carry flag

A lot of operations will alter the zero and negative flag (ZF and NF), so let’s implement a function that adjusts them based on an operation result:

MFN(void,ADJZN,u8){zf=A==0;nf=A>>7;}

Since 6502 has instructions that push and pop the processor’s flags, we need a way to serialize and deserialize flags into a single byte:

ZFN(u8,GFL){R nf<<7|vf<<6|32|bf<<4|df<<3|idf<<2|zf<<1|cf;}
MFN(void,SFL,u8){SHF(nf,7);SHF(vf,6);SHF(bf,4);SHF(df,3);SHF(idf,2);SHF(zf,1);SHF(cf,0);}

Interrupts

We talked for a brief moment about interrupts, but we didn’t quite explain them in enough detail. Some code will aid us:

MFN(void,INTR,u16){PSH16(pc);PSH8(GFL());pc=R16(A);idf=1;}
ZFN(void,nmi){bf=0;INTR(NMIV);}
ZFN(void,res){bf=0;INTR(RESV);}
ZFN(void,irq){if(idf==0){bf=0;INTR(IRQV);}}

So, nmi, res and irq are a part of our “public” API. nmi and res clear the bf flag and call INTR with the corresponding vector, meanwhile irq does it only if idf (the interrupt/IRQ disable flag) is not set.

The INTR function is a bit more interesting. We can notice that it:

  • pushes the current instruction pointer on the stack
  • pushes the flags on the stack
  • loads the new instruction pointer from memory at A (the argument to INTR, usually it’s the value of NMIV, IRQV, etc…)
  • disables IRQ/interrupt handling (because we’re already inside an interrupt!)

Arithmetics & bit operations

You probably remember the decimal flag we talked about earlier. BCD (Binary Coded Decimal) numbers can be directly added or subracted on the 6502 by using decimal mode. The basics of decimal mode are generally fairly simple, but the specifics of its operation and use aren’t always well documented or explained.

A byte has 256 possible states - $00 to $FF. These values may represent any sort of data. The most common way of representing numbers is as a binary number (or more specifically, an unsigned binary integer), where $00 … $FF represents numbers 0 … 255. In BCD, a byte represents a number from 0 to 99, where the byte values represented in hexadecimal are treated as decimals. So, for example $39 is 39, $05 is 5, $99 is 99. The largest value such a decimal can take is thus $99 and it can’t contain hexadecimal digits (ABCDEF). In other words, the upper digit (0 to 9) of the BCD number is stored in the upper 4 bits of the byte, and the lower digit is stored in the lower 4 bits. These 100 values are called valid BCD numbers. The other 156 possible values of a byte (i.e. where either or both hex digits are A to F) are called invalid BCD numbers. By contrast, all 256 possible values of a byte are valid binary numbers.

When $28 is a BCD number it represents 28 (in decimal), and when $28 is a binary number it represents 40 (in decimal).

The term BCD arithmetic means arthmetic that involves BCD numbers, whereas the term binary arithmetic means arithmetic that involves binary numbers. The same naming convention applies to addition and subtraction. For example, $19 + $28 = $47 using BCD addition (df = 1), but $19 + $28 = $41 (i.e. 25 + 40 = 65) using binary addition.

6502 code usually assumes that the processor in is binary mode (df = 0) upon entry, and these programs will often return in binary mode (usually because they don’t affect df). In this case, BCD calculations are performed by starting with a SED instruction, followed by the instructions to perform calculation itself, and finishing with an CLD instruction.

Let’s look at the ADC instruction implementation now:

MFN(void,ADC,u16){u8 V=R8(A),cy,l,h;if(df){cy=cf;l=(V&0xF)+(a&0xF)+cy;if(l>9)l+=6;
h=(V>>4)+(a>>4)+(l>0xF);vf=~(a^V)&(a^(h<<4))&0x80;nf=vf=zf=cf=0;if(h>9)h+=6;
if(h>0xF)cf=1;a=(h<<4)|(l&0xF);if(!a)zf=1;if(h&0x8)nf=1;}else{u16 r=a+V+cf;
ADJZN(r&0xFF);vf=(~(a^V)&(a^r)&0x80);cf=r&0xFF00;a=r&0xFF;}}

Big and scary, but we can manage. First, let’s notice that this implementation basically boils down to:

MFN(void,ADC,u16){u8 V=R8(A),cy,l,h;if(df){
   /* decimal implementation */
}else{
   /* binary implementation */
}}

V is the other operand for addition (read from memory). So, because the binary implementation is simpler, we will now zoom on it:

u16 r=a+V+cf;
ADJZN(r&0xFF);
vf=(~(a^V)&(a^r)&0x80);
cf=r&0xFF00;
a=r&0xFF;

First, we calculate the addition result - accumulator, the 2nd operand and the carry. Then, we adjust zf and nf. The rest of the code sets flags approperiately. (~(a^V)&(a^r)&0x80) detects if the addition has overflowed, r&0xFF00 extracts the carry (everything that doesn’t fit to a byte, which could be also rewritten as r>0xFF) and a=r&0xFF sets the addition result.

Wait a second. How does this obscure carry formula work? Let the carry out of the full adder adding the least significant bit be called $c_0$. Then, the carry out of the full adder adding the next least significant bit is $c_1$. Thus, the carry out of the full adder adding the most significant bits is $c_{k - 1}$. This assumes that we are adding two $k$ bit numbers. We can write the formula as $V = c_{k - 1}$. This is effectively XORing the carry-in and the carry-out of the leftmost full adder. Why does this work? The XOR of the carry-in and carry-out differ if there’s either a 1 being carried in, and a 0 being carried out, or if there’s a 0 being carried in, and a 1 being carried out.

If you thought that this is hard, the BCD part is even worse. There are a dozen of sequences for performing this operation on binary-coded decimals:

Sequence 1:

  • AL = (A & $0F) + (B & $0F) + C
  • If AL >= $0A, then AL = ((AL + $06) & $0F) + $10
  • A = (A & $F0) + (B & $F0) + AL
  • A can be >= $100 at this point
  • If A >= $A0, then A = A + $60
  • The accumulator result is the lower 8 bits of A
  • The carry result is 1 if A >= $100, and 0 if A < $100

Sequence 2:

  • AL = (A & $0F) + (B & $0F) + C
  • If AL >= $0A, then AL = ((AL + $06) & $0F) + $10
  • A = (A & $F0) + (B & $F0) + AL, using signed (twos complement) arithmetic
  • The N flag result is 1 if bit 7 of A is 1, and 0 if bit 7 if A is 0
  • The V flag result is 1 if A < -128 or A > 127, and 0 if -128 <= A <= 127

On 6502, sequence 1 is used to compute A and C, while sequence 2 is used to compute N and V. Z is based on the BCD accumulator result. For instance, 99 + 1 in decimal mode yields the accumulator to be equal to $00:

SED
CLC
LDA #$99
ADC #$01

… and this is the algorithm the bit of code above implements. Moving on, to SBC:

MFN(void,SBC,u16){u8 V=R8(A),l,h;u16 r;if(df){u8 cy=!cf;r=a-V-cy;l=(a&0xF)-(V&0xF)-cy;
if(l>>7)l-=6;h=(a>>4)-(V>>4)-(l>>7);nf=vf=zf=cf=0;if((a^V)&(a^r)&0x80)vf=1;cf=!(r&0xFF00);
if(h>>7)h-=6;a=(h<<4)|(l&0xF);if(!a)zf=1;if(h&0x8)nf=1;}else{r=a-V-!cf;ADJZN(r&0xFF);
vf=(a^V)&(a^r)&0x80;cf=!(r&0xFF00);a=r&0xFF;}}

As most of it is isomorphic to the description before, I’ll skip the binary part and describe just the decimal case of this algorithm:

  • AL = (A & $0F) - (B & $0F) + C-1
  • If AL < 0, then AL = ((AL - $06) & $0F) - $10
  • A = (A & $F0) - (B & $F0) + AL
  • If A < 0, then A = A - $60
  • The accumulator result is the lower 8 bits of A

On 6502, this algorithm is used to compute the value of A. C, V, N and Z flags are computed the same way as in the binary mode.

Time for something more straightforward now. Let’s implement increment and decrement:

MFN(u8,INC,u8){u8 r=A+1;ADJZN(r);R r;}
MFN(u8,DEC,u8){u8 r=A-1;ADJZN(r);R r;}

Nothing to see here, just +1 / -1 and adjusting the Z and N flags. A bunch of other operations:

/* shift left; adjust the carry flags, perform our
   general routine of fixing up Z/N flags. */
MFN(u8,ASL,u8){u8 r=A<<1;cf=A>>7;ADJZN(r);R r;}
/* shift right; adjust the carry, z and n flag.
   carry is set, because the least significant bit
   might have been shifted out. */
MFN(u8,LSR,u8){u8 r=A>>1;cf=A&1;ADJZN(r);R r;}

/* bitwise operations - and, xor, or */
MFN(void,AND,u16){a&=R8(A);ADJZN(a);}
MFN(void,EOR,u16){a^=R8(A);ADJZN(a);}
MFN(void,ORA,u16){a|=R8(A);ADJZN(a);}

/* set the Z flag as though the value in the address tested
   were ANDed with the accumulator. The N and V flags are
   set to match bits 7 and 6 respectively in the value
   stored at the tested address. */
MFN(void,BIT,u16){u8 v=R8(A);vf=(v>>6)&1;zf=(v&a)==0;nf=v>>7;}

/* rotate one bit right & left */
MFN(u8,ROR,u8){u8 r=A>>1;r|=cf<<7;cf=A&1;ADJZN(r);R r;}
MFN(u8,ROL,u8){u8 r=A<<1;r|=cf;cf=A>>7;ADJZN(r);R r;}

Branching

This will be fairly simple. Let’s start with a conditional jump that jumps if the flag provided is set:

DFN(void,J,i8,n1){if(B)pc+=A;}

We also want an unconditional jump:

MFN(void,U,u16){pc=A;}

A thing that I haven’t explained earlier (when we were discussing interrupts) is “how to return from an interrupt?" The answer is fairly simple:

ZFN(void,RTI){SFL(POP8());pc=POP16();}

First, we pop off the flags, then we pop off the instruction pointer. So we’re doing the exact reverse of what happened in the INTR function. Speaking of returning, we can implement JSR and RTS now:

MFN(void,JSR,u16){PSH16(pc-1);pc=A;}
ZFN(void,RTS){pc=POP16()+1;}

… and CMP:

DFN(void,CMP,u16,u8){u8 v=R8(A);u8 r=B-v;cf=B>=v;ADJZN(r);}

The CMP instruction is used for comparisons as the mnemonics suggests. The way it works is simple - they just perform a subtraction. Strictly speaking, CMP A is similar to SEC / SBC N. Both of these variants affect the N, Z, and C flags in exactly the same way. However, unlike SBC:

  • CMP is not affected by df
  • The accumulator is not affected by the operation
  • the V flag is not affected

A useful property of CMP is that it performs an equality comparison and an unsigned comparison. After CMP, the Z flag contains the equality comparison result and the C flag contains the unsigned comparison result, specifically:

  • Z = 0 => A != N, BNE will branch.
  • Z = 1 => A == N. BEQ will branch.
  • C = 0 => A < N, BCC will branch.
  • C = 1 => A >= N, BCS will branch.

Additional operand combinations

Many instructions take additional operand combinations:

MEMO(INC)REGO(INC)
MEMO(DEC)REGO(DEC)
MEMO(ASL)
MEMO(LSR)
MEMO(ROR)
MEMO(ROL)

…according to this table:

mnemonic register operand memory operand
INC X X
DEC X X
ROL X
ROR X
ASL X
LSR X

State initialisation and step function

We’re nearly done! Everything left now is initialising the processor state and implementing a step function. Let’s do that now:

ZFN(void,init){sp=0xFD;pc=a=x=y=zf=cf=idf=df=bf=vf=nf=0;}

The initialisation functions gives some default value to SP (just in case the program doesn’t initialise it for any reason), clears all the flags, the instruction pointer, address registers and the accumulator.

Finally, the code for the stepping function follows:

ZFN(void,step){switch(R8(pc++)){
    default:
    IH(0xA9)LDR(&a,IMM()) /* LDA IMM */ IH(0xA5)LDR(&a,ZPG()) /* LDA ZPG */
    IH(0xB5)LDR(&a,ZPX()) /* LDA ZPX */ IH(0xAD)LDR(&a,ABS()) /* LDA ABS */
    IH(0xBD)LDR(&a,ABX()) /* LDA ABX */ IH(0xB9)LDR(&a,ABY()) /* LDA ABY */
    IH(0xA1)LDR(&a,INX()) /* LDA INX */ IH(0xB1)LDR(&a,INY()) /* LDA INY */
    IH(0xA2)LDR(&x,IMM()) /* LDX IMM */ IH(0xA6)LDR(&x,ZPG()) /* LDX ZPG */
    IH(0xB6)LDR(&x,ZPY()) /* LDX ZPY */ IH(0xAE)LDR(&x,ABS()) /* LDX ABS */
    IH(0xBE)LDR(&x,ABY()) /* LDX ABY */ IH(0xA0)LDR(&y,IMM()) /* LDY IMM */
    IH(0xA4)LDR(&y,ZPG()) /* LDY ZPG */ IH(0xB4)LDR(&y,ZPX()) /* LDY ZPX */
    IH(0xAC)LDR(&y,ABS()) /* LDY ABS */ IH(0xBC)LDR(&y,ABX()) /* LDY ABX */
    IH(0x85)W8(ZPG(),a)   /* STA ZPG */ IH(0x95)W8(ZPX(),a)   /* STA ZPX */
    IH(0x8D)W8(ABS(),a)   /* STA ABS */ IH(0x9D)W8(ABX(),a)   /* STA ABX */
    IH(0x99)W8(ABY(),a)   /* STA ABY */ IH(0x81)W8(INX(),a)   /* STA INX */
    IH(0x91)W8(INY(),a)   /* STA INY */ IH(0x86)W8(ZPG(),x)   /* STX ZPG */
    IH(0x96)W8(ZPY(),x)   /* STX ZPY */ IH(0x8E)W8(ABS(),x)   /* STX ABS */
    IH(0x84)W8(ZPG(),y)   /* STY ZPG */ IH(0x94)W8(ZPX(),y)   /* STY ZPX */
    IH(0x8C)W8(ABS(),y)   /* STY ABS */
    IH(0xAA)x=a;ADJZN(x)  /* TAX */     IH(0xA8)y=a;ADJZN(y)  /* TAY */
    IH(0xBA)x=sp;ADJZN(x) /* TSX */     IH(0x8A)a=x;ADJZN(a)  /* TXA */
    IH(0x9A)sp=x          /* TXS */     IH(0x98)a=y;ADJZN(a)  /* TYA */
    IH(0x69)ADC(IMM())    /* ADC IMM */ IH(0x65)ADC(ZPG())    /* ADC ZPG */
    IH(0x75)ADC(ZPX())    /* ADC ZPX */ IH(0x6D)ADC(ABS())    /* ADC ABS */
    IH(0x7D)ADC(ABX())    /* ADC ABX */ IH(0x79)ADC(ABY())    /* ADC ABY */
    IH(0x61)ADC(INX())    /* ADC INX */ IH(0x71)ADC(INY())    /* ADC INY */
    IH(0xC6)DECA(ZPG())   /* DEC ZPG */ IH(0xD6)DECA(ZPX())   /* DEC ZPX */
    IH(0xCE)DECA(ABS())   /* DEC ABS */ IH(0xDE)DECA(ABX())   /* DEC ABX */
    IH(0xCA)DECR(&x)      /* DEX */     IH(0x88)DECR(&y)      /* DEY */
    IH(0xE6)INCA(ZPG())   /* INC ZPG */ IH(0xF6)INCA(ZPX())   /* INC ZPX */
    IH(0xEE)INCA(ABS())   /* INC ABS */ IH(0xFE)INCA(ABX())   /* INC ABX */
    IH(0xE8)INCR(&x)      /* INX */     IH(0xC8)INCR(&y)      /* INY */
    IH(0xE9)SBC(IMM())    /* SBC IMM */ IH(0xE5)SBC(ZPG())    /* SBC ZPG */
    IH(0xF5)SBC(ZPX())    /* SBC ZPX */ IH(0xED)SBC(ABS())    /* SBC ABS */
    IH(0xFD)SBC(ABX())    /* SBC ABX */ IH(0xF9)SBC(ABY())    /* SBC ABY */
    IH(0xE1)SBC(INX())    /* SBC INX */ IH(0xF1)SBC(INY())    /* SBC INY */
    IH(0x29)AND(IMM())    /* AND IMM */ IH(0x25)AND(ZPG())    /* AND ZPG */
    IH(0x35)AND(ZPX())    /* AND ZPX */ IH(0x2D)AND(ABS())    /* AND ABS */
    IH(0x3D)AND(ABX())    /* AND ABX */ IH(0x39)AND(ABY())    /* AND ABY */
    IH(0x21)AND(INX())    /* AND INX */ IH(0x31)AND(INY())    /* AND INY */
    IH(0x0A)a=ASL(a)      /* ASL ACC */ IH(0x06)ASLA(ZPG())   /* ASL ZPG */
    IH(0x16)ASLA(ZPX())   /* ASL ZPX */ IH(0x0E)ASLA(ABS())   /* ASL ABS */
    IH(0x1E)ASLA(ABX())   /* ASL ABX */
    IH(0x24)BIT(ZPG())    /* BIT ZPG */ IH(0x2C)BIT(ABS())    /* BIT ABS */
    IH(0x49)EOR(IMM())    /* EOR IMM */ IH(0x45)EOR(ZPG())    /* EOR ZPG */
    IH(0x55)EOR(ZPX())    /* EOR ZPX */ IH(0x4D)EOR(ABS())    /* EOR ABS */
    IH(0x5D)EOR(ABX())    /* EOR ABX */ IH(0x59)EOR(ABY())    /* EOR ABY */
    IH(0x41)EOR(INX())    /* EOR INX */ IH(0x51)EOR(INY())    /* EOR INY */
    IH(0x4A)a=LSR(a)      /* LSR ACC */ IH(0x46)LSRA(ZPG())   /* LSR ZPG */
    IH(0x56)LSRA(ZPX())   /* LSR ZPX */ IH(0x4E)LSRA(ABS())   /* LSR ABS */
    IH(0x5E)LSRA(ABX())   /* LSR ABX */ IH(0x09)ORA(IMM())    /* ORA IMM */
    IH(0x05)ORA(ZPG())    /* ORA ZPG */ IH(0x15)ORA(ZPX())    /* ORA ZPX */
    IH(0x0D)ORA(ABS())    /* ORA ABS */ IH(0x1D)ORA(ABX())    /* ORA ABX */
    IH(0x19)ORA(ABY())    /* ORA ABY */ IH(0x01)ORA(INX())    /* ORA IMM */
    IH(0x11)ORA(INY())    /* ORA IMM */ IH(0x2A)a=ROL(a)      /* ROL ACC */
    IH(0x26)ROLA(ZPG())   /* ROL ZPG */ IH(0x36)ROLA(ZPX())   /* ROL ZPX */
    IH(0x2E)ROLA(ABS())   /* ROL ABS */ IH(0x3E)ROLA(ABX())   /* ROL ABX */
    IH(0x6A)a=ROR(a)      /* ROR ACC */ IH(0x66)RORA(ZPG())   /* ROR ZPG */
    IH(0x76)RORA(ZPX())   /* ROR ZPX */ IH(0x6E)RORA(ABS())   /* ROR ABS */
    IH(0x7E)RORA(ABX())   /* ROR ABX */ IH(0x90)J(REL(),!cf)  /* BCC REL */
    IH(0xB0)J(REL(),cf)   /* BCS REL */ IH(0xD0)J(REL(),!zf)  /* BNE REL */
    IH(0xF0)J(REL(),zf)   /* BEQ REL */ IH(0x10)J(REL(),!nf)  /* BPL REL */
    IH(0x30)J(REL(),nf)   /* BMI REL */ IH(0x50)J(REL(),!vf)  /* BVC REL */
    IH(0x70)J(REL(),vf)   /* BVS REL */ IH(0x4C)U(ABS())      /* JMP */
    IH(0x6C)U(RB16(ABS()))/* JMP */     IH(0x20)JSR(ABS())    /* JSR */
    IH(0x40)RTI()         /* RTI */     IH(0x60)RTS()         /* RTS */
    IH(0x38)cf=1          /* SEC */     IH(0x18)cf=0          /* CLC */
    IH(0xF8)df=1          /* SED */     IH(0xD8)df=0          /* CLD */
    IH(0x78)idf=1         /* SEI */     IH(0x58)idf=0         /* CLI */
    IH(0xB8)vf=0          /* CLV */
    IH(0xC9)CMP(IMM(),a)  /* CMP IMM */ IH(0xC5)CMP(ZPG(),a)  /* CMP ZPG */
    IH(0xD5)CMP(ZPX(),a)  /* CMP ZPX */ IH(0xCD)CMP(ABS(),a)  /* CMP ABS */
    IH(0xDD)CMP(ABX(),a)  /* CMP ABX */ IH(0xD9)CMP(ABY(),a)  /* CMP ABY */
    IH(0xC1)CMP(INX(),a)  /* CMP INX */ IH(0xD1)CMP(INY(),a)  /* CMP INY */
    IH(0xE0)CMP(IMM(),x)  /* CPX IMM */ IH(0xE4)CMP(ZPG(),x)  /* CPX ZPG */
    IH(0xEC)CMP(ABS(),x)  /* CPX ABS */ IH(0xC0)CMP(IMM(),y)  /* CPY IMM */
    IH(0xC4)CMP(ZPG(),y)  /* CPY ZPG */ IH(0xCC)CMP(ABS(),y)  /* CPY ABS */

    IH(0x48)PSH8(a)       /* PHA */     IH(0x68)a=POP8();ADJZN(a) /* PLA */
    IH(0x08)bf=1;PSH8(GFL()) /* PHP */  IH(0x28)SFL(POP8())   /* PLP */
    IH(0x00)bf=1;pc+=1;INTR(BRKV) /* BRK */
    IH(0xEA)break; /* NOP */
    }
}

There are many sources for 6502 instruction encodings, but I like this one. Maybe you’ll like it too.

Testing our emulator

To test our emulator, let’s tack some code on the bottom:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef uint32_t u32;

DFN(void, loadmem, const char *, u16) {
    u32 size;
    FILE * f = fopen(A, "rb"); assert(f);
    fseek(f, 0, SEEK_END); size = ftell(f); rewind(f);
    assert(size + B <= MEM);
    assert(fread(&m[B], sizeof(u8), size, f) == size);
    fclose(f);
}

This small routine will read a ROM image to memory and put it on a specified offset.

ZFN(void, test1) {
    memset(m, 0, MEM);
    loadmem("test1.bin", 0x4000);
    init(); res();

    for(;;) {
        step();

        if (pc == 0x45C0) {
            printf("%s\n", m[0x0210] == 0xFF ? "passed" : "failed");
            return;
        }
    }
}

Now, we load a testing memory image (available in the source code repository mentioned down below alongside a disassembly and a bunch of other files) at #$4000. An infinite loop calls step(); until the memory pointer reaches #$45C0, in which case, if the byte under #$0210 is #$FF, the test has passed. Otherwise, it has failed.

A tiny bit of driver code will run this example program:

int main(void) {
    m = malloc(MEM);
    test1();
}

… and, as expected, the tests passed!~

 0 [20:23] ~/workspace/6502-emu % gcc 6502.c -o 6502
 0 [20:23] ~/workspace/6502-emu % ./6502
passed

Klaus Dormann’s testing suite

This one will be a bit tricky, because Klaus' testing suite makes some assumption about our very own homebrew 6502 computer that aren’t true for our model. But we can easily fix that.

Let’s start off with changing the “tweakable” defaults:

// Stack start address
#define STK 0x100
// BRK vector
#define BRKV 0xFFFE
// NMI vector
#define NMIV 0xFFFA
// Reset vector
#define RESV 0xFFFC
// IRQ vector
#define IRQV 0xFFFE

… and then create a new testing routine:

ZFN(void, test2) {
    memset(m, 0, MEM);
    loadmem("6502_functional_test.bin", 0);
    init(); pc = 0x400;
    u16 prev_pc = 0, ins = 0;

    for(;;) {
        step();

        if (prev_pc == pc) {
            if(pc == 0x3469) {
              printf("passed");
              break;
            }
            printf("failed (trap at $%04X)", pc);
            break;
        }
        prev_pc = pc;
        ins++;
        if(ins >= 100000) {
          printf("test took too long, broke at $%04X", pc);
          break;
        }
    }
}

The program’s expected entry point lies at #$0400. If the program ends up in an infinite loop at any point, we assume that the test either passed or failed (which is why we keep track of prev_pc; if something goes really wrong, we also have a check for the test taking too many cycles - ins and the if statement near the end guards against that).

If the program loops at pc=#$3469, we assume it succeeded. Let’s try it now:

 0 [20:40] ~/workspace/6502-emu % gcc main.c -o 6502
 0 [21:40] ~/workspace/6502-emu % ./6502
passed%

… and it works! This means that our 6502 emulator is a perfect replica in terms of functionality of a genuine 6502. Let’s look at our checklist:

Let’s set the goals first. With our emulator, we will want to achieve the following:

  • small, portable code (check)
  • decimal mode implemented (check)
  • 100% coverage of legal opcodes (check)
  • emulating hardware glitches (check) What isn’t the goal:
  • cycle accurancy
  • performance

Making a disassembler

When I was digging through 6502 testing suites, I found a small binary blob that had a description attached to it (it explained what kind of hardware does it expect). I really wanted to know what’s exactly inside of it, and to do that I’d have to disassemble the hex code.

I thought it’d be a nice project to make my own 6502 disassembler to go with the emulator. So, let’s do that:


#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Instruction tables. Copied from some online source.

enum { IMP, IMPA, MARK2, BRA, IMM, ZP, ZPX, ZPY, INDX, INDY, IND, MARK3, ABS, ABSX, ABSY, IND16, IND1X };
enum { I_ADC, I_AND, I_ASL, I_BCC, I_BCS, I_BEQ, I_BIT, I_BMI, I_BNE, I_BPL, I_BRA, I_BRK, I_BVC, I_BVS,
       I_CLC, I_CLD, I_CLI, I_CLV, I_CMP, I_CPX, I_CPY, I_DEC, I_DEX, I_DEY, I_EOR, I_INC, I_INX, I_INY,
       I_JMP, I_JSR, I_LDA, I_LDX, I_LDY, I_LSR, I_NOP, I_ORA, I_PHA, I_PHP, I_PHX, I_PHY, I_PLA, I_PLP,
       I_PLX, I_PLY, I_ROL, I_ROR, I_RTI, I_RTS, I_SBC, I_SEC, I_SED, I_SEI, I_STA, I_STP, I_STX, I_STY,
       I_STZ, I_TAX, I_TAY, I_TRB, I_TSB, I_TSX, I_TXA, I_TXS, I_TYA, I_WAI, I_XXX };

char ops[] =
      "ADCANDASLBCCBCSBEQBITBMIBNEBPLBRABRKBVCBVS"
      "CLCCLDCLICLVCMPCPXCPYDECDEXDEYEORINCINXINY"
      "JMPJSRLDALDXLDYLSRNOPORAPHAPHPPHXPHYPLAPLP"
      "PLXPLYROLRORRTIRTSSBCSECSEDSEISTASTPSTXSTY"
      "STZTAXTAYTRBTSBTSXTXATXSTYAWAI???";

char insname[256] = {
    I_BRK, I_ORA, I_XXX, I_XXX, I_TSB, I_ORA, I_ASL, I_XXX, I_PHP, I_ORA, I_ASL, I_XXX, I_TSB, I_ORA, I_ASL, I_XXX,
    I_BPL, I_ORA, I_ORA, I_XXX, I_TRB, I_ORA, I_ASL, I_XXX, I_CLC, I_ORA, I_INC, I_XXX, I_TRB, I_ORA, I_ASL, I_XXX,
    I_JSR, I_AND, I_XXX, I_XXX, I_BIT, I_AND, I_ROL, I_XXX, I_PLP, I_AND, I_ROL, I_XXX, I_BIT, I_AND, I_ROL, I_XXX,
    I_BMI, I_AND, I_AND, I_XXX, I_BIT, I_AND, I_ROL, I_XXX, I_SEC, I_AND, I_DEC, I_XXX, I_BIT, I_AND, I_ROL, I_XXX,
    I_RTI, I_EOR, I_XXX, I_XXX, I_XXX, I_EOR, I_LSR, I_XXX, I_PHA, I_EOR, I_LSR, I_XXX, I_JMP, I_EOR, I_LSR, I_XXX,
    I_BVC, I_EOR, I_EOR, I_XXX, I_XXX, I_EOR, I_LSR, I_XXX, I_CLI, I_EOR, I_PHY, I_XXX, I_XXX, I_EOR, I_LSR, I_XXX,
    I_RTS, I_ADC, I_XXX, I_XXX, I_STZ, I_ADC, I_ROR, I_XXX, I_PLA, I_ADC, I_ROR, I_XXX, I_JMP, I_ADC, I_ROR, I_XXX,
    I_BVS, I_ADC, I_ADC, I_XXX, I_STZ, I_ADC, I_ROR, I_XXX, I_SEI, I_ADC, I_PLY, I_XXX, I_JMP, I_ADC, I_ROR, I_XXX,
    I_BRA, I_STA, I_XXX, I_XXX, I_STY, I_STA, I_STX, I_XXX, I_DEY, I_BIT, I_TXA, I_XXX, I_STY, I_STA, I_STX, I_XXX,
    I_BCC, I_STA, I_STA, I_XXX, I_STY, I_STA, I_STX, I_XXX, I_TYA, I_STA, I_TXS, I_XXX, I_STZ, I_STA, I_STZ, I_XXX,
    I_LDY, I_LDA, I_LDX, I_XXX, I_LDY, I_LDA, I_LDX, I_XXX, I_TAY, I_LDA, I_TAX, I_XXX, I_LDY, I_LDA, I_LDX, I_XXX,
    I_BCS, I_LDA, I_LDA, I_XXX, I_LDY, I_LDA, I_LDX, I_XXX, I_CLV, I_LDA, I_TSX, I_XXX, I_LDY, I_LDA, I_LDX, I_XXX,
    I_CPY, I_CMP, I_XXX, I_XXX, I_CPY, I_CMP, I_DEC, I_XXX, I_INY, I_CMP, I_DEX, I_WAI, I_CPY, I_CMP, I_DEC, I_XXX,
    I_BNE, I_CMP, I_CMP, I_XXX, I_XXX, I_CMP, I_DEC, I_XXX, I_CLD, I_CMP, I_PHX, I_STP, I_XXX, I_CMP, I_DEC, I_XXX,
    I_CPX, I_SBC, I_XXX, I_XXX, I_CPX, I_SBC, I_INC, I_XXX, I_INX, I_SBC, I_NOP, I_XXX, I_CPX, I_SBC, I_INC, I_XXX,
    I_BEQ, I_SBC, I_SBC, I_XXX, I_XXX, I_SBC, I_INC, I_XXX, I_SED, I_SBC, I_PLX, I_XXX, I_XXX, I_SBC, I_INC, I_XXX
};

char addrmode[256] = {
    IMP, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMPA,  IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZP,   ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMPA,  IMP,  ABS,	  ABSX,	 ABSX, IMP,
    ABS, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMPA,  IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZPX,  ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMPA,  IMP,  ABSX,	  ABSX,	 ABSX, IMP,
    IMP, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMPA,  IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZP,   ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMP,   IMP,  ABS,	  ABSX,	 ABSX, IMP,
    IMP, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMPA,  IMP,  IND16,  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZPX,  ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMP,   IMP,  IND1X,  ABSX,	 ABSX, IMP,
    BRA, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMP,   IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZPX,  ZPX,   ZPY,	 IMP,	IMP,  ABSY,  IMP,   IMP,  ABS,	  ABSX,	 ABSX, IMP,
    IMM, INDX,  IMM, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMP,   IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZPX,  ZPX,   ZPY,	 IMP,	IMP,  ABSY,  IMP,   IMP,  ABSX,	  ABSX,	 ABSY, IMP,
    IMM, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMP,   IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZP,   ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMP,   IMP,  ABS,	  ABSX,	 ABSX, IMP,
    IMM, INDX,  IMP, IMP,  ZP,   ZP,	   ZP,	 IMP,	IMP,  IMM,   IMP,   IMP,  ABS,	  ABS,	 ABS,  IMP,
    BRA, INDY,  IND, IMP,  ZP,   ZPX,   ZPX,	 IMP,	IMP,  ABSY,  IMP,   IMP,  ABS,	  ABSX,	 ABSX, IMP
};

I copied a bunch of instruction tables from the internet. Now, we can dive into the actual disassembler:

uint16_t disassemble(uint16_t addr) {
    uint16_t temp, mode, ind;
    uint8_t p1, p2;
    uint8_t op = getchar();
    mode = addrmode[op];
    p1 = (mode > MARK2) ? getchar() : 0;
    p2 = (mode > MARK3) ? getchar() : 0;

First, we determine the size of an instruction and we load it into our variables.

    ind = insname[op] * 3;
    printf("%04X   ", addr);
    if(mode > MARK3)
        printf("%02X %02X %02X   ", op, p1, p2);
    else if(mode > MARK2)
        printf("%02X %02X      ", op, p1);
    else
        printf("%02X         ", op);
    for(temp = 0; temp < 3; temp++)
        putchar(ops[ind + temp]);
    putchar(' ');

… then, we display the hexadecimal dump of said instruction alongside its address.

    switch (mode) {
        case IMPA: putchar('A'); break;
        case BRA: temp = addr + 2 + (signed char) p1; printf("%04X", temp); addr++; break;
        case IMM: printf("#$%02X", p1); addr++; break;
        case ZP: printf("$%02X", p1); addr++; break;
        case ZPX: printf("$%02X,X", p1); addr++; break;
        case ZPY: printf("$%02X,Y", p1); addr++; break;
        case IND: printf("($%02X)", p1); addr++; break;
        case INDX: printf("($%02X,X)", p1); addr++; break;
        case INDY: printf("($%02X),Y", p1); addr++; break;
        case ABS: printf("$%02X%02X", p2, p1); addr += 2; break;
        case ABSX: printf("$%02X%02X,X", p2, p1); addr += 2; break;
        case ABSY: printf("$%02X%02X,Y", p2, p1); addr += 2; break;
        case IND16: printf("($%02X%02X)", p2, p1); addr += 2; break;
        case IND1X: printf("($%02X%02X,X)", p2, p1); addr += 2; break;
    }

Since the addressing modes on 6502 are really simple, and this is all it takes to disassemble a mnemonic, we can just switch over the addressing mode used by that particular opcode. Finally, we skip to a new line, go to the next instruction, and return the current address.

    putchar('\n');
    addr++;
    return addr;
}

Now, we add a main function stub that runs disassemble on itself as long as there’s input on stdin:

int main(int argc, char * argv[]) {
    uint16_t addr = 0;
    while(!feof(stdin))
        addr = disassemble(addr);
}

… but, wait. Some programs, like the one above, required us to put them on a specific offset in memory. Our assembler should also support that. So let’s do that - it’s just a simple patch that comes before the while loop:

    uint16_t addr = argc == 2 ? atoi(argv[1]) : 0;
    printf("                  * = $%04X\n", addr);

There’s a small problem with this code - if I want my code on offset #$0400, I don’t want to have to convert it to decimal. It’s just inconvenient. So, let’s replace atoi with a call to our function that detects the base and converts the number:

static int smart_cvt(char * number) {
    int base = 10;
    if(!strncmp(number, "0x", 2)) base = 16;
    else if(!strncmp(number, "0b", 2)) base = 2;
    else if(!strncmp(number, "0o", 2)) base = 8;
    return strtol((base != 10) * 2 + number, NULL, base);
}

int main(int argc, char * argv[]) {
    uint16_t addr = argc == 2 ? smart_cvt(argv[1]) : 0;
    printf("                  * = $%04X\n", addr);
    while(!feof(stdin))
        addr = disassemble(addr);
}

Done! Now we can run it on this mysterious program:


                  * = $4000
4000   A9 00      LDA #$00
4002   8D 10 02   STA $0210
4005   A9 55      LDA #$55
4007   8D 00 02   STA $0200
400A   A9 AA      LDA #$AA
400C   8D 01 02   STA $0201
400F   A9 FF      LDA #$FF
4011   8D 02 02   STA $0202
[...]
4070   91 60      STA ($60),Y
4072   A9 7E      LDA #$7E
4074   B1 60      LDA ($60),Y
4076   9D FF 07   STA $07FF,X
4079   A9 7E      LDA #$7E
407B   BD FF 07   LDA $07FF,X
407E   99 FF 07   STA $07FF,Y
4081   A9 7E      LDA #$7E
4083   B9 FF 07   LDA $07FF,Y
4086   81 36      STA ($36,X)
4088   A9 7E      LDA #$7E
408A   A1 36      LDA ($36,X)
408C   86 50      STX $50
[...]
42E5   A9 00      LDA #$00
42E7   60         RTS
42E8   95 0D      STA $0D,X
42EA   A5 40      LDA $40
42EC   CD 04 02   CMP $0204
42EF   F0 08      BEQ $42F9
42F1   A9 04      LDA #$04
42F3   8D 10 02   STA $0210
42F6   4C C0 45   JMP $45C0
42F9   A9 35      LDA #$35
42FB   AA         TAX
42FC   CA         DEX
42FD   CA         DEX
42FE   E8         INX
42FF   8A         TXA
4300   A8         TAY
4301   88         DEY
4302   88         DEY
4303   C8         INY
4304   98         TYA
4305   AA         TAX
4306   A9 20      LDA #$20
4308   9A         TXS
4309   A2 10      LDX #$10
430B   BA         TSX
430C   8A         TXA
430D   85 40      STA $40
430F   A5 40      LDA $40
4311   CD 05 02   CMP $0205
4314   F0 08      BEQ $431E
4316   A9 05      LDA #$05
4318   8D 10 02   STA $0210
431B   4C C0 45   JMP $45C0
431E   2A         ROL A

… and it seems like it worked!

For the persistent reader

The Github repository with all the code.