# Background

Back in January of 2021, I entered a small duel - the person to make the smallest SK calculus reduction machine in brainfuck before the deadline wins. As the author of asm2bf, I feel fairly confident working with it, so I thought that I could write a simple SK calculus reductor, and then optimise it for size using various tricks up my sleeve.

# Let’s dive into the code

The original source code for the SK calculus evaluator is now listed as an asm2bf example, so you can browse it in the asm2bf repository. In this blog post, I’ll try to just dissect it.

; SK calculus evaluator.
; Doesn't garbage collect, has very limited recursion depth.
; you can tweak this limit by changing the stk parameter below.
; Copyright (C) Kamila Szewczyk, 2021.

[bits 8]

stk 38
org 0

&heap
db 2

&changed
db 0


The beginning of the code is fairly self-explanatory. We’re limited to 8-bit brainfuck, so we make sure that asm2bf guards us against overflowing something known at the compile-time. Because the evaluator is recursive, it requires some stack space. The “default” version ships with 38 cells of stack space. We assume that the heap starts from the second cell of memory, and we declare a special global variable changed which faciliates putting the single-combinator reducing function in a fixpoint.

#call("read")
@loop
#call("eval")
rcl r2, *changed
vxcall sto *changed, 0
jnz r2, %loop


To start things off - main program loop. We read the SK calculus term, then reduce it, reset the changed variable and stop the reduction process if the term wasn’t changed. Next up - a fairly smart optimisation and a recursive term printing function:

    push 0
@P
push r2
rcl r2, r1
sub r1, 2
jz r2, %PS
dec r2
jz r2, %PK
out .(
mov r2, r1
rcl r1, r2
#call("P")
inc r2
rcl r1, r2
#call("P")
out .)
pop r2
ret
@PS
out .S
pop r2
ret
@PK
out .K
pop r2
ret


P is a function. We don’t call it, since by the end of evaluation we have to print the resulting term and labels in asm2bf exhibit the fallthru behavior. The problem here is that the P function returns to the label which is expected to be pushed on the stack. As an optimisation, I decided to push 0 and let the procedure code execute, so that when it’s done, it jumps to the label with ID 0, which terminates the program.

Then, we have a simple bump allocator that allocates 3 cells of memory at a time and tags the allocated memory area with the value in r1:

@A
rcl f3, *heap
sto f3, r1
inc f3
ots f3, *heap
sub f3, 3
mov r1, f3
ret


In other words, @A allocates a single tree node which is used to represent the SKI calculus expression.

Next up, we have something that resembles the print function from earlier, except @read allocates and reads a single SK calculus term. The RS and RK cases take care of, respectively, the S and K calculus, while the later part takes care of binary tree nodes. From this function, it is easy to infer that the node structure in memory is [type] [node 1] [node 2], where [type] is either 0 (S), 1 (K) or 2 (binary tree node).

@read
push r2
in r2
ceq r2, .S
cjnz %RS
ceq r2, .K
cjnz %RK
mov r1, 2
#call("A")
mov r2, r1
sto r2, r1
inc r2
sto r2, r1
dec r2
mov r1, r2
in r2
pop r2
ret
@RS
mov r1, 0
#call("A")
pop r2
ret
@RK
mov r1, 1
#call("A")
pop r2
ret


Now, everything left is the evaluation function:

@eval
push r2
push r3
rcl r2, r1
jz r2, %skip
rcl r3, r2
jz r3, %skip
rcl r4, r3
cne r4, 1
cjnz %skip
inc r2
rcl r3, r2
mov r1, r3
vxcall sto *changed, 1
jmp %not_bi


The first case we handle is the K combinator - first we check if the tree structure is correct, then we check if the combinator is K, and finally, we return the first argument to it.

The somewhat more involved case is the S combinator - it requires allocating a few new nodes, and in general involves weird node shuffling action. We start with checking if the tree is suitable to perform the S combinator reduction:

@skip
rcl r2, r1
jz r2, %notS
rcl r3, r2
jz r3, %notS
rcl r4, r3
jz r4, %notS
rcl r2, r4
jnz r2, %notS


Then according to the (((Sx)y)z) => ((xz)(yz)) rule, we transform the tree allocating two new nodes for (xz) and (yz), and the final node to contain these two. Everything is doable within just registers.

    inc r1
rcl r6, r1
dec r1
rcl r2, r1
rcl r3, r2
inc r2
inc r3
rcl r5, r2
rcl r4, r3
mov r1, 2
#call("A")
mov r2, r1
mov r1, 2
#call("A")
sto r2, r1
sto r1, r4
inc r1
sto r1, r6
mov r1, 2
#call("A")
inc r2
sto r2, r1
dec r2
sto r1, r5
inc r1
sto r1, r6
mov r1, r2
vxcall sto *changed, 1
jmp %not_bi


Last but not least, we have to account for the case where we have to go deeper to reduce an expression, so we finish the reduction function with a double recursion on both binary tree nodes:

@notS
rcl r2, r1
cge r2, 2
cjz %not_bi
mov r3, r1
rcl r1, r3
#call("eval")
sto r3, r1
inc r3
rcl r1, r3
#call("eval")
sto r3, r1
dec r3
mov r1, r3


And at the end of our reduction procedure we embed a common return point which pops off the preserved registers and returns.

@not_bi
pop r3
pop r2
ret


# The code golf part - making asm2bf shine

There are two obvious problems with this code. First, the stack takes a lot of space, so sto, rcl & friends will emit many >s an <s to get to the taperam. Second, the registers are picked suboptimally (because the code was written by a human). Third, too much stuff is preserved and move semantics weren’t utilised at all. The generated brainfuck code is 36 kilobytes big - not bad, but not great either.

We start off with fixing the first problem by creating stub functions for sto and rcl. While at it, we also add the move ^ prefix to a few instructions where preserving isn’t required.

; SK calculus evaluator.
; Doesn't garbage collect, has very limited recursion depth.
; you can tweak this limit by changing the stk parameter below.
; Copyright (C) Kamila Szewczyk, 2021.

[bits 8]

stk 38
org 0

&heap
db 2

&changed

$( function def_inline(ins) print("@" .. ins .. "_i\n" .. ins .. " f2, f3\nret\n") end function inline(ins, arg1, arg2) print("mov f2, " .. arg1 .. "\nmov f3, " .. arg2 .. "\n") call(ins .. "_i") end function inline_r(ins, arg1, arg2) print("mov f3, " .. arg2 .. "\n") call(ins .. "_i") print("mov " .. arg1 .. ", f2") end ) #call("read") @loop #call("eval")$(inline_r("rcl", "r2", "*changed"))
mov f2, *changed
mov f3, 0
#call("sto_i")
^jnz r2, %loop
^push 0
@P
^push r2
$(inline_r("rcl", "r2", "r1")) sub r1, 2 jz r2, %PS dec r2 jz r2, %PK out .( ^mov r2, r1$(inline_r("rcl", "r1", "r2"))
#call("P")
inc r2
$(inline_r("rcl", "r1", "r2")) #call("P") out .) pop r2 ret @PS out .S pop r2 ret @PK out .K pop r2 ret @A mov f3, *heap #call("rcl_i") add f2, 2 ^mov f3, r1 #call("sto_i") inc f2 ots f2, *heap sub f2, 3 ^mov r1, f2 ret @read ^push r2 in r2 ceq r2, .S cjnz %RS ceq r2, .K cjnz %RK mov r1, 2 #call("A") mov r2, r1 #call("read")$(inline("sto", "r2", "r1"))
inc r2
$(inline("sto", "r2", "r1")) dec r2 ^mov r1, r2 in r2 pop r2 ret @RS mov r1, 0 #call("A") pop r2 ret @RK mov r1, 1 #call("A") pop r2 ret @eval ^push r2 ^push r3$(inline_r("rcl", "r2", "r1"))
jz r2, %skip
$(inline_r("rcl", "r3", "r2")) jz r3, %skip add r3, 2$(inline_r("rcl", "r4", "r3"))
cne r4, 1
cjnz %skip
inc r2
$(inline_r("rcl", "r3", "r2")) mov r1, r3 mov f2, *changed mov f3, 1 #call("sto_i") jmp %not_bi @skip$(inline_r("rcl", "r2", "r1"))
jz r2, %notS
$(inline_r("rcl", "r3", "r2")) jz r3, %notS$(inline_r("rcl", "r4", "r3"))
jz r4, %notS
$(inline_r("rcl", "r2", "r4")) jnz r2, %notS inc r1$(inline_r("rcl", "r6", "r1"))
dec r1
$(inline_r("rcl", "r2", "r1"))$(inline_r("rcl", "r3", "r2"))
inc r2
inc r3
$(inline_r("rcl", "r5", "r2"))$(inline_r("rcl", "r4", "r3"))
mov r1, 2
#call("A")
^mov r2, r1
mov r1, 2
#call("A")
$(inline("sto", "r2", "r1"))$(inline("sto", "r1", "r4"))
inc r1
$(inline("sto", "r1", "r6")) mov r1, 2 #call("A") inc r2$(inline("sto", "r2", "r1"))
dec r2
$(inline("sto", "r1", "r5")) inc r1$(inline("sto", "r1", "r6"))
mov r1, r2
mov f2, *changed
mov f3, 1
#call("sto_i")
jmp %not_bi
@notS
$(inline_r("rcl", "r2", "r1")) cge r2, 2 cjz %not_bi ^mov r3, r1$(inline_r("rcl", "r1", "r3"))
#call("eval")
$(inline("sto", "r3", "r1")) inc r3$(inline_r("rcl", "r1", "r3"))
#call("eval")
$(inline("sto", "r3", "r1")) dec r3 ^mov r1, r3 @not_bi pop r3 pop r2 ret #def_inline("sto") #def_inline("rcl")  The size certainly improved - from 36 KB, we drop down to 30.2 KB, although this result still isn’t satisfactory for me. For this reason, I decided to replace all registers with fictional names and define them at the top of my source unit:  ?AX=r1 ?BX=r2 ?CX=r3 ?DX=r4 ?EX=f5 ?FX=f6 ?GX=f2 ?HX=f3 stk 38 org 0 &heap db 2 &changed$(
function def_inline(ins)
print("@" .. ins .. "_i\n" .. ins .. " GX, HX\nret\n")
end

function inline(ins, arg1, arg2)
print("mov GX, " .. arg1 .. "\nmov HX, " .. arg2 .. "\n")
call(ins .. "_i")
end

function inline_r(ins, arg1, arg2)
print("mov HX, " .. arg2 .. "\n")
call(ins .. "_i")
print("mov " .. arg1 .. ", GX")
end
)

@loop
#call("eval")
$(inline_r("rcl", "BX", "*changed")) mov GX, *changed mov HX, 0 #call("sto_i") ^jnz BX, %loop ^push 0 @P ^push BX add AX, 2$(inline_r("rcl", "BX", "AX"))
sub AX, 2
jz BX, %PS
dec BX
jz BX, %PK
out .(
^mov BX, AX
$(inline_r("rcl", "AX", "BX")) #call("P") inc BX$(inline_r("rcl", "AX", "BX"))
#call("P")
out .)
pop BX
ret
@PS
out .S
pop BX
ret
@PK
out .K
pop BX
ret
@A
mov HX, *heap
#call("rcl_i")
^mov HX, AX
#call("sto_i")
inc GX
ots GX, *heap
sub GX, 3
^mov AX, GX
ret
^push BX
in BX
clr AX
ceq BX, .S
cjnz %RK
ceq BX, .K
cjnz %RK
#call("A")
mov BX, AX
$(inline("sto", "BX", "AX")) #call("read") inc BX$(inline("sto", "BX", "AX"))
dec BX
^mov AX, BX
in BX
pop BX
ret
@RK
#call("A")
pop BX
ret
@eval
^push CX
$(inline_r("rcl", "BX", "AX")) jz BX, %skip$(inline_r("rcl", "CX", "BX"))
jz CX, %skip
$(inline_r("rcl", "DX", "CX")) cne DX, 1 cjnz %skip inc BX$(inline_r("rcl", "CX", "BX"))
mov AX, CX
mov GX, *changed
mov HX, 1
#call("sto_i")
jmp %not_bi
@skip
$(inline_r("rcl", "BX", "AX")) jz BX, %notS$(inline_r("rcl", "CX", "BX"))
jz CX, %notS
$(inline_r("rcl", "DX", "CX")) jz DX, %notS add DX, 2$(inline_r("rcl", "BX", "DX"))
jnz BX, %notS
inc AX
$(inline_r("rcl", "FX", "AX")) dec AX$(inline_r("rcl", "BX", "AX"))
$(inline_r("rcl", "CX", "BX")) inc BX inc CX$(inline_r("rcl", "EX", "BX"))
$(inline_r("rcl", "DX", "CX")) mov AX, 2 #call("A") ^mov BX, AX mov AX, 2 #call("A")$(inline("sto", "BX", "AX"))
$(inline("sto", "AX", "DX")) inc AX$(inline("sto", "AX", "FX"))
mov AX, 2
#call("A")
inc BX
$(inline("sto", "BX", "AX")) dec BX$(inline("sto", "AX", "EX"))
inc AX
$(inline("sto", "AX", "FX")) mov AX, BX mov GX, *changed mov HX, 1 #call("sto_i") jmp %not_bi @notS$(inline_r("rcl", "BX", "AX"))
cge BX, 2
cjz %not_bi
^mov CX, AX
$(inline_r("rcl", "AX", "CX")) #call("eval")$(inline("sto", "CX", "AX"))
inc CX
$(inline_r("rcl", "AX", "CX")) #call("eval")$(inline("sto", "CX", "AX"))
dec CX
^mov AX, CX
@not_bi
pop CX
ret

#def_inline("sto")
#def_inline("rcl")


I’ve also noticed that eval doesn’t have to preserve the BX register, which lets me get rid of a push/pop pair. Either way, this preparation was conducted because I want to create write a tiny register allocator for asm2bf, which will find the most optimal register layout for our program given these constraints.

I decided to start off with a bit of code that generates us the register mappings given a permutation of numbers from 1 to 8:

 f←{
regs←'r1' 'r2' 'r3' 'r4' 'r5' 'r6' 'f2' 'f3'
maps←'AX' 'BX' 'CX' 'DX' 'EX' 'FX' 'GX' 'HX'
⍺,⍨'?'∘,¨maps,¨'='∘,¨regs[⍵]
}


Next up, I mounted a 64M ramdisk at /mnt/ramdisk and removed the ?...=... headers from my source code. Let’s write a function that will test a specific layout:

 try←{
r←'r1' 'r2' 'r3' 'r4' 'r5' 'r6' 'f2' 'f3'
m←'AX' 'BX' 'CX' 'DX' 'EX' 'FX' 'GX' 'HX'
_←(⊂⍺,⍨'?'∘,¨m,¨'='∘,¨r[⍵])⎕NPUT'/mnt/ramdisk/X.asm' 1
_←⎕SH'bfmake /mnt/ramdisk/X.asm'
≢⊃⊃⎕NGET'/mnt/ramdisk/X.b' 1
}


I have inlined the f function, beause it seemed relatively convenient. I could have used ⎕NSIZE instead of ⎕NGET, but it doesn’t yield a significant performance improvement and complicates the code a little, because ⎕NSIZE expects a number tied with a file, not a file name. Either way, now it’s time to import the pmat function from dfns and write a stub function that tests the entire permutation matrix from 1 to 8:

 run←{
str←⊃⎕NGET ⍵ 1
m←↓pmat 8
s←⍕≢m
⎕←∊'total: 's
m[(⊃⍋)((⍳≢m){
size←str try ⍵
0=500|⍺:{_←⎕←∊(⍕⍵)'/'s ⋄ size}⍺
size
}¨m)]
}


Finally, I ran the function on the input file and went to make myself a coffee and take a short break. When I came back, I was greeted with the following:

      run '/home/palaiologos/Desktop/workspace/asmbf/examples/tiny_sk.asm'
total: 40320
500/40320
1000/40320
[...]
39500/40320
40000/40320
┌───────────────┐
│3 2 5 6 7 8 1 4│
└───────────────┘


Brilliant! Now we know what is the best permutation! It appears that the size of our program decreased by a few kilobytes:

      str try 3 2 5 6 7 8 1 4
27469


The try function generated our new assembly listing in the ramdisk, so I snatched it from there:

?AX=r3
?BX=r2
?CX=r5
?DX=r6
?EX=f2
?FX=f3
?GX=r1
?HX=r4


Finally, after removing the redundant PUSH BX/POP BX part, the size of our program settles on 27312 bytes.

# Benchmarks

Each value is the run time in seconds for the following test suite:

test "(((SK)K)K)" "K"
test "((((S((S(KS))K))(K((SK)K)))(KK))(SS))" "K"
test "((((S(K(S((SK)K))))K)S)K)" "(KS)"
test "(((S((S(K((SK)K)))((S((S(K((SK)K)))((SK)K)))(K(K(K((SK)K)))))))(K((S(KK))((SK)K))))((S((S(K((SK)K)))((S(K((S(KS))(S(K((SK)K))))))((S(KK))((SK)K)))))(K((SK)K))))" "(K((SK)K))"
test "((((S(KS))((S(K(S(KS))))((S(K(S(KK))))((SK)K))))((SK)K))((SK)K))" "((S((S(KS))((S(KK))((SK)K))))((SK)K))"

 Interpreter tritium bfi Program Minimum Maximum Average Minimum Maximum Average tiny-sk, registers 0.218 0.238 0.221 7.702 7.896 7.801 tiny-sk, plain 0.229 0.255 0.241 7.614 7.799 7.646 sk 0.208 0.223 0.208 6.440 6.619 6.500

To no surprise, plain sk is faster than any tiny-sk, because of less indirection. Register-optimized version of tiny-sk seems to be faster on tritium, and a little slower on bfi, which was fairly unexpected.

# BFVM’s say

There’s a different way of approaching the problem. We could take sk (tiny_sk won’t work since it uses move semantics + it won’t be any faster because of the thunks it uses for sto/rcl) and compile it using BFVM and then clang to get a decent native binary.

The results oscillate around 0.027s and 0.082s, with 0.032s being the average value. We’ve beaten tritium!