Introduction

In January of 2021, one of my Discord servers was visited by a person who demonstrated a non-alphanumeric snippet of code written in C.

Manna5: This C code gives us the highest number from a 2-element array _. Can you write more valid and functional C expressions without using alphanumeric characters?

Just being myself, initially I wanted to see if the non-alphanumeric subset of C is Turing-complete. The simplest Turing-complete device I could think of around that time was the Minsky machine, for which I made my own notation.

#define _ return
_$(_$$,_$_,__$,_$$$,_$$_){_ _$$_>_$$?__$:_$(_$$,__$,
_$_+__$,_$_+__$,++_$$_);} _$_$(_$$){_!_$$?('('-'('):
_$__(_$$,('-'-'-'),('+'-'*'),('<'-'<'),('_'-']'));}

– A relatively efficient fibonacci function in non-alphanumeric C.

Minsky Machine is, simply put, a computer capable of storing two infinitely large numbers in its memory. It can perform two operations - inc X (increment X-th number) and dec X, Y (decrement X-th number if it’s greater than zero, otherwise jump to instruction Y).

This example Minsky machine program loops infinitely:

1:    1 -1      // => increment r1
2:    1  1      // => decrement r1, if zero jump to line 1
3:    1  1      // => decrement r1, if zero jump to line 1

A Minsky machine program in my notation is a list of integer tuples. Thinking in terms of a single tuple (numbers x and y), which is equivalent to a single instruction:

  • If the y is equal to -1, then this instruction increments the register given by x.
  • If the y is not equal to -1, then this instruction decrements the register given by x if it’s greater than zero. Otherwise, the instruction pointer is altered to point the instruction given by y.

It’s possible to express this program in terms of C code:

int regs[2] = { 0 };

int main(void) {
    for(int ip = 1; ip; ip++)
        switch(ip) {
            case 1: regs[0]++; break;
            case 2: if(regs[0] > 0) regs[0]--; else ip = 1; break;
            case 3: if(regs[0] > 0) regs[0]--; else ip = 1; break;
        }
}

As an inquiring mind, you might be wondering how is this extremely simple set of instructions Turing-complete. The answer to this question is given by the textbook “The Theory of Parsing, Translation, and Compiling” from 1972, but to keep things simple, I won’t provide a proof right now.

Either way, having linked Minsky machines, turing completeness and non-alphanumeric C code in my head, I sketched up the following program:

#define ONE ('%'-'$')
#define ZERO (ONE-ONE)
#define TWO (ONE+ONE)
#define THREE (TWO+ONE)
#define FOUR (THREE+ONE)
#define FIVE (FOUR+ONE)
#define SIX (FIVE+ONE)
#define SEVEN (SIX+ONE)
#define EIGHT (SEVEN+ONE)
#define NINE (EIGHT+ONE)
#define IF_IP(x,y,z) ((__==(x))?(y):(z))
#define C1 ___
#define C2 ____
#define NOP __

__=ZERO,C1,C2;_(){
IF_IP(ONE, C1++, IF_IP(TWO, C1--, NOP))
;__++;_();}

Passing it through the C preprocessor yields exactly what I was looking for:

% gcc -E c1.c | perl -pe "s/^#.*\n$//g;"
__='$'-'$',___,____;_(){((__==(('%'-'$')))?(___++
):(((__==((('%'-'$')+('%'-'$'))))?(___--):(__))))
;__++;_();}

I define a bunch of numeric constants obtained by questionably smart arithmetical manipulation, two counters and construct a program that increments C1, decrements C1 and loops infinitely - but it’s a start. Having written this, I felt that translating Minsky machines to non-alphanumeric C might be trivial. In fact, all that needs to be done is the jumping behavior of decrementation, which can be accomplished by reassigning the instruction pointer, and the infinite looping problem can be solved by checking if the instruction pointer didn’t go out of bounds of the program. Having figured that all out and written a Perl script that automates the translation to some degree, I was happy that I played with this odd concept for a while and forgot about it for good. After all - I had something better to do back then.

"?remind 5h take a closer look at this" - (a while later) - “not now, i’m playing with my gf”

The fun part

I wrote a transpiler that targets non-alphanumeric C, which I explain further into the post - it contains a lot of technical terms, assumes knowledge of mathematics, APL, and nifty C tricks. For now though, let’s play with this little black box.

I start with a barebones fibonacci function:

int fib(int n) {
    if (n <= 1)
        return n;
    return fib(n - 1) + fib(n - 2);
}

First, I turn the entire function into a single expression for convenience:

int fib(int n) {
    return n <= 1 ? n : fib(n - 1) + fib(n - 2);
}

Then, I remove all the whitespace and replace the names with their uppercase counterparts (so they’re picked up by the transpiler).

int FIB(int N){return N<=1?N:FIB(n-1)+FIB(n-2);}

I can abuse the implicit int rule to get rid of the type declarations, but I have to force the compiler to emit a #define directive for return:

%define return
FIB(N){return N<=1?N:FIB(N-1)+FIB(N-2);}

And after invoking our compiler, I get the following magical nonsense:

#define _ return
_$(_$$){_ _$$<=('?'-'>')?_$$:_$(_$$-('*'-')'))+_$(_$$-('>'-'<'));}

There are a few obvious improvements that I could make (x(yz)=xy+zx - (y - z) = x - y + z and the precedence-founded parenthesis dropping):

#define _ return
_$(_$$){_ _$$<='?'-'>'?_$$:_$(_$$-'*'+')')+_$(_$$-'>'+'<');}

Isn’t it beautiful? And it works too! But… what about my initial goal? I still don’t have a concrete proof of Turing completeness and going the simple Minsky Machine route doesn’t sound excessively fun. Let’s implement Brainfuck in our C subset instead! I sketched the following, normal Brainfuck interpreter in C:

int match(char * code, int i, int dir) {
    int level = 1;
    do {
        i += dir;
        if (code[i] == '[')
            level += dir;
        else if (code[i] == ']')
            level -= dir;
    } while (level > 0);
    return i;
}

void runbf(int(*putchar)(int), int(*getchar)(void),
        char * code, int code_size) {
    int tape_buf[65536] = {0}; int * tape = tape_buf;
    for(int i = 0; i < code_size; i++) {
        if(code[i] == '+') ++*tape;
        else if(code[i] == '-') --*tape;
        else if(code[i] == '>') tape++;
        else if(code[i] == '<') tape--;
        else if(code[i] == '.') putchar(*tape);
        else if(code[i] == ',') *tape = getchar();
        else if(code[i] == '[') { if(!*tape) i = match(code, i, 1); }
        else if(code[i] == ']') { i = match(code, i, -1) - 1; }
    }
}

It’s fairly self-explanatory, so I’ll avoid getting into details. What’s more interesting is how to turn our code into the crazy, amazing, non-alphanumeric form.

I started off by turning a bunch of local variables to parameters and replacing some if statements. Then, I split the match function to avoid a do..while loop:

int match_loop(char * code, int i, int dir, int level) {
    while(level) {
        i += dir;
        level += dir * (code[i] == '[' ? 1 : code[i] == ']' ? -1 : 0);
    }
    return i;
}

int match(char * code, int i, int dir, int level) {
    i += dir;
    level += dir * (code[i] == '[' ? 1 : code[i] == ']' ? -1 : 0);
    return match_loop(code, i, dir, level);
}

void runbf(int(*putchar)(int), int(*getchar)(void),
        char * code, int code_size) {
    int tape_buf[65536] = {0}; int * tape = tape_buf;
    for(int i = 0; i < code_size; i++) {
        if(code[i] == '+') ++*tape;
        else if(code[i] == '-') --*tape;
        else if(code[i] == '>') tape++;
        else if(code[i] == '<') tape--;
        else if(code[i] == '.') putchar(*tape);
        else if(code[i] == ',') *tape = getchar();
        else if(code[i] == '[') i = !*tape ? match(code, i, 1, 1) : i;
        else if(code[i] == ']') i = match(code, i, -1, 1) - 1;
    }
}

From there, match is simple to turn into an expression. I also turned match_loop into a recursive function that doesn’t make use of the while construct. I rolled up the if statement in match_loop, and in the meantime, I transformed runbf using the same logic, getting rid of the for loop:

int match_loop(char * code, int i, int dir, int level) {
    return!level?i:match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

int match(char * code, int i, int dir, int level) {
    return match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

void runbf_2(int(*putchar)(int), int(*getchar)(void), char * code, int code_size, int * tape, int i) {
    while(i < code_size) {
        if(code[i] == '+') ++*tape;
        else if(code[i] == '-') --*tape;
        else if(code[i] == '>') tape++;
        else if(code[i] == '<') tape--;
        else if(code[i] == '.') putchar(*tape);
        else if(code[i] == ',') *tape = getchar();
        else if(code[i] == '[') i = !*tape ? match(code, i, 1, 1) : i;
        else if(code[i] == ']') i = match(code, i, -1, 1) - 1;
        i++;
    }
}

void runbf(int(*putchar)(int), int(*getchar)(void),
        char * code, int code_size) {
    int tape_buf[65536] = {0}; runbf_2(putchar, getchar, code, code_size, tape_buf, 0);
}

Now, I made runbf_2 recursive and collapsed if statements into ternaries.

int match_loop(char * code, int i, int dir, int level) {
    return!level?i:match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

int match(char * code, int i, int dir, int level) {
    return match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

void runbf_2(int(*putchar)(int), int(*getchar)(void), char * code, int code_size, int * tape, int i) {
    if(i < code_size) {
        *tape += code[i]=='+'?1:code[i]=='-'?-1:code[i]=='.'?(putchar(*tape),0):code[i]==','?(getchar()-*tape):0;
        tape += code[i]=='>'?1:code[i]=='<'?-1:0;
        runbf_2(putchar, getchar, code, code_size, tape, (code[i] == '[' ? (!*tape ? match(code, i, 1, 1) : i) : code[i] == ']' ? (match(code, i, -1, 1) - 1) : i) + 1);
    }
}

void runbf(int(*putchar)(int), int(*getchar)(void),
        char * code, int code_size) {
    int tape_buf[65536] = {0}; runbf_2(putchar, getchar, code, code_size, tape_buf, 0);
}

Nearly done! I added a dummy parameter to runbf_2 and relied on the evaluation order to set the *tape value, while casually calling runbf_2. The last step is collapsing the final if statement into a ternary.

int match_loop(char * code, int i, int dir, int level) {
    return!level?i:match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

int match(char * code, int i, int dir, int level) {
    return match_loop(code, i + dir, dir, level + (dir * (code[i + dir] == '[' ? 1 : code[i + dir] == ']' ? -1 : 0)));
}

int runbf_2(int dummy, int(*putchar)(int), int(*getchar)(void), char * code, int code_size, int * tape, int i) {
    return i < code_size ? runbf_2(*(tape += code[i]=='>'?1:code[i]=='<'?-1:0) += code[i]=='+'?1:code[i]=='-'?-1:code[i]=='.'?(putchar(*tape),0):code[i]==','?(getchar()-*tape):0, putchar, getchar, code, code_size, tape, (code[i] == '[' ? (!*tape ? match(code, i, 1, 1) : i) : code[i] == ']' ? (match(code, i, -1, 1) - 1) : i) + 1) : 0;
}

void runbf(int(*putchar)(int), int(*getchar)(void),
        char * code, int code_size) {
    int tape_buf[65536] = {0}; runbf_2(0, putchar, getchar, code, code_size, tape_buf, 0);
}

Finally, I renamed all the symbols and prepared the source code for compilation. I didn’t manage to get rid of return, char and int, so I defined these:

%define int
%define return
%define char
int A(char*D,int E,int F,int G){return!G?E:A(D,E+F,F,G+(F*(D[E+F]=='['?1:D[E+F]==']'?-1:0)));}
int B(char*D,int E,int F,int G){return A(D,E+F,F,G+(F*(D[E+F]=='['?1:D[E+F]==']'?-1:0)));}
int C(int F,int(*G)(int),int(*H)(),char*D,int I,int*J,int E){return E<I?C(*(J+=D[E]=='>'?1:D[E]=='<'?-1:0)+=D[E]=='+'?1:D[E]=='-'?-1:D[E]=='.'?(G(*J),0):D[E]==','?(H()-*J):0,G,H,D,I,J,(D[E]=='['?(!*J?B(D,E,1,1):E):D[E]==']'?(B(D,E,-1,1)-1):E)+1):0;}
int RUN_BF(int(*G)(int),int(*H)(),char*D,int I){int F[65536]={0};return C(0,G,H,D,I,F,0);}

After passing it through the compiler, that’s the output I got:

#define _ int
#define __ return
#define ___ char
_ _$(___*_$$,_ _$_,_ __$,_ _$$$){__!_$$$?_$_:_$(_$$,_$_+__$,__$,_$$$+(__$*(_$$[_$_+__$]=='['?('&'-'%'):_$$[_$_+__$]==']'?-('/'-'.'):('_'-'_'))));} _ _$$_(___*_$$,_ _$_,_ __$,_ _$$$){__ _$(_$$,_$_+__$,__$,_$$$+(__$*(_$$[_$_+__$]=='['?('^'-']'):_$$[_$_+__$]==']'?-('?'-'>'):('>'-'>'))));} _ _$_$(_ __$,_(*_$$$)(_),_(*_$__)(),___*_$$,_ __$$,_*__$_,_ _$_){__ _$_<__$$?_$_$(*(__$_+=_$$[_$_]=='>'?('"'-'!'):_$$[_$_]=='<'?-('"'-'!'):('}'-'}'))+=_$$[_$_]=='+'?('*'-')'):_$$[_$_]=='-'?-('.'-'-'):_$$[_$_]=='.'?(_$$$(*__$_),('}'-'}')):_$$[_$_]==','?(_$__()-*__$_):('_'-'_'),_$$$,_$__,_$$,__$$,__$_,(_$$[_$_]=='['?(!*__$_?_$$_(_$$,_$_,('#'-'"'),(';'-':')):_$_):_$$[_$_]==']'?(_$$_(_$$,_$_,-(')'-'('),(';'-':'))-('%'-'$')):_$_)+('-'-',')):(']'-']');} _ ___$__$$___$(_(*_$$$)(_),_(*_$__)(),___*_$$,_ __$$){_ __$[(('}'-'!')*((('}'-'!')*(('+'-'$')))+('}'-':')+('+'-'*')))+('{'-'[')]={(']'-']')};__ _$_$(('"'-'"'),_$$$,_$__,_$$,__$$,__$,('/'-'/'));}

Let’s grab a “Hello World” program in brainfuck and run my majestic non-alphanumeric interpreter on it:

% cat hello.c
#include <stdio.h>
#include "output.c"

int main(void) {
    char * code = "++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.";
    ___$__$$___$(putchar, getchar, code, strlen(code));
}
% gcc hello.c -o hello && ./hello
Hello World!

The compiler - constant generation

In October of 2021, the circumstances made me remember the idea I had with obtaining numbers via subtracting ASCII character codes. I had some time on my hands (around 30 minutes) since I wasn’t home at the time, but I had access to a computer, so I thought that I can make a program that turns any number into a non-alphanumeric C expression to kill some time.

First, I asked my APL session to give me all the non-alphanumeric characters (everything in range 33 to 126 that isn’t a lowercase letter, uppercase letter or a digit) that aren’t problematic (aren’t a quote or a slash):

ns.cs←(⎕UCS∘(32∘+)⍳93)~∊⎕A ⎕D(⎕C ⎕A)'''' '\'

Now, since the numbers in their basic form are represented as character minus character ('X'-'Y'), I generated a table of all possible combinations of it using the outer product:

∘.,⍨ns.cs

Then, I tied the resulting character pairs with their ASCII code difference and turned it into a simple list:

,{⍵,⍥⊂¨-/¨⎕UCS¨⍵}∘.,⍨cs

Next up, I removed all the pairs that sum up to a negative number or zero and sorted the results descendingly, starting with the pairs that have the largest difference:

ns.keys←{⌽⍵[⍋2∘⊃¨⍵]}{⍵/⍨{0<2⊃⍵}¨⍵},{⍵,⍥⊂¨-/¨⎕UCS¨⍵}∘.,⍨cs

Finally, for convenience I made a separate vector out of the 2nd values in the vector:

2∘⊃¨ns.keys

None of this data ever changes, so I wrapped it into a setup function that precomputes it:

setup←{
    ⍝ the cvt namespace with precomputed data for
    ⍝ the initial character set.
    ns←⎕NS ⍬
    ⍝ non-alphanumeric character set.
    ns.cs←(⎕UCS∘(32∘+)⍳93)~∊⎕A ⎕D(⎕C ⎕A)'''' '\'
    ⍝ --> quote a string.
    ns.quot←{'''',⍵,''''}
    ⍝ sort a vector of tuples based on the second element.
    sortk←{⌽⍵[⍋2∘⊃¨⍵]}
    ⍝ --> vector of tuples of character pairs that yield
    ⍝     a given result when subtracted as ASCII codepoints.
    ns.keys←sortk{⍵/⍨{0≤2⊃⍵}¨⍵},{⍵,⍥⊂¨-/¨⎕UCS¨⍵}∘.,⍨ns.cs
    ⍝ every second element extracted from the tuples.
    ns.ord←2∘⊃¨ns.keys
    ns
}

Then, I used the data I just obtained to create a function, which generates pairs of ASCII character differences that sum up to a given number:

      convert 162
┌──┬──┬──┐
│}!│}:│^[│
└──┴──┴──┘
      ⍝ A-ha! ('}'-'!')+('}'-':')+('^'-'[') must be equal to 162.

The basic algorithm is fairly simple:

  • Take the the largest element from the keys vector which is lesser or equal to the number curretly processed.
  • Push the element’s character tuple to the result list.
  • Subtract the element’s ASCII codepoint difference from the input.
  • Repeat the process until the number is zero.

An implementation of this algorithm follows:

convert←{
    prop←⍺
    ⍝ Encode a number.
    eN←{
        ⍝ initialise the result vector; if the number to convert
        ⍝ is zero, return it.
        ⍺←⍬ ⋄ ⍵=0:⍺
        ⍝ pick a list of tuples from keys which second element is the
        ⍝ maximum value smaller or equal to the number to convert.
        ⍝ pick a random tuple from this list.
        x y←(?∘≢⊃⊢) prop.keys/⍨prop.ord=2⊃⊃prop.keys/⍨⍵≥prop.ord
        ⍝ append the results, subtract from the input
        ⍝ number and recurse.
        (⍺,⊂x)∇ ⍵-y
    }
    ⍵=0:⊂'$$'⋄eN ⍵
}

(?∘≢⊃⊢) picks a random element from a vector, because it’s equivalent to {(?≢⍵)⊃⍵}, which takes a random number in range 1 to the length of vector, and picks the element with the index given by this random number.

The next step is converting the pairs intto actual C code, and noticing a certain issue:

      convert 1252
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│[#│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘

1252 isn’t that big of a number, yet it yields so many operations. Thankfully, there exists a simple fix for this issue - k=1nx=xn\sum_{k=1}^{n} x = xn. It should be said that there exists a threshold for which this transformation has an adventage over the naive method. The algorithm I implemented performs it when n5n \geq 5.

      x←eN 1513
      x
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│}!│`:│,)│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘

After obtaining the naive transformation result, I grouped the additions using my favourite APL operator:

      ↓,∘≢⌸x
┌───────┬──────┬──────┐
│┌──┬──┐│┌──┬─┐│┌──┬─┐│
││}!│16│││`:│1│││,)│1││
│└──┴──┘│└──┴─┘│└──┴─┘│
└───────┴──────┴──────┘

The full code follows:

    cp←(⊣,'+',⊢)/
    cvts←{'(',(prop.quot ⍺),'-',(prop.quot ⍵),')'}/
    cvte←{
        emit←{
            5≤2⊃⍵:'(',(cvts⊃⍵),'*(',(cvte 2⊃⍵),'))'
            x y←⍵ ⋄ cp↑y⍴⊂cvts x
        }
        ⍵≠0:∊cp emit¨↓,∘≢⌸eN ⍵
        ⊃cvts 2⍴rE prop.cs
    }

After wrapping everything in a function, I managed to confirm that the code emits reasonable output:

      cvtp cvt 12523
(('}'-'!')*(('}'-'!')+('['-'/')))+('/'-'$')
% cat code.c
#include <stdio.h>

int main(void) {
    printf("%d", (('}'-'!')*(('}'-'!')+('['-'/')))+('/'-'$'));
}
% ./code
12523

Around then, my time ran out, but when I was home, approached the entire problem from a different standpoint. To create a Turing-complete language, I to storing data one way or another. In C, which is the language I target, the most reasonable choice is using variables. Either way, variables and functions require an accompanying identifier, and underscore-only names would be incredibly long. Thankfully, GNU C allows putting the dollar sign in identifiers, so I can employ some sort of binary counting in the identifier names:

6.42 Dollar Signs in Identifier Names

In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.

GCC documentation

This doesn’t apply to preprocessor macros though, so all-underscore names will be reserved by the compiler for use as macros.

If all the defined names are stored as a vector in a variable x, assuming $ as 0 and _ as 1, one can generate identifiers by treating their indices as binary numbers:

      {'$_'[1+2⊥⍣¯1⊢⍵]}¨⍳≢x
┌─┬──┬──┬───┬───┬───┐
│_│_$│__│_$$│_$_│__$│
└─┴──┴──┴───┴───┴───┘

…but the underscores are still here, and that’s not what I wanted. Let’s remove these then:

      {⍵/⍨1≠≢∘∪¨⍵}{'$_'[1+2⊥⍣¯1⊢⍵]}¨⍳≢x
┌──┬───┬───┬───┐
│_$│_$$│_$_│__$│
└──┴───┴───┴───┘

The problem with this solution alone is fairly clear - I wanted 6 identifier names, but after purging the invalid ones I’m left with only four, so the goal is to predict (somehow) how many numbers in form 2^n - 1 are there in range from <1, x>. The number I’m looking for is exactly the amount of digits in x’s binary representation, so I added 1+ceil(log_2(x)) to the input to make sure that the code always generates just enough identifier names. Sometimes it will generate too many, so the final reshape takes care of that:

      {⍵⍴{⍵/⍨1≠≢∘∪¨⍵}{'$_'[1+2⊥⍣¯1⊢⍵]}¨⍳{⍵+1+⌈2⍟⍵}⍵}≢x
┌──┬───┬───┬───┬────┬────┐
│_$│_$$│_$_│__$│_$$$│_$$_│
└──┴───┴───┴───┴────┴────┘

The algorithm for generating the macro names is obvious:

      '_'⍴¨⍨⍳5
┌─┬──┬───┬────┬─────┐
│_│__│___│____│_____│
└─┴──┴───┴────┴─────┘

Having solved two main problems I faced, I glued both of these solutions together to create a working “compiler”:

transcode←{
    cf←cvtp∘cvt
    dec←{
        0=≢⍵:0 ⋄ 11::'Hexadecimal number too big'⎕SIGNAL 11
        16⊥16|1-⍨'0123456789abcdef0123456789ABCDEF'⍳¯1↓⍵
    }
    ra←{code∘←((↑⍵[2])⎕R(↑⍵[1]))code}
    code←('\$[^\$]+\$'⎕R{⍕#⍎1↓¯1↓⍵.Match})⍵
    code←('\b[0-9][0-9A-Fa-f]*h\b'⎕R{cf dec ⍵.Match})code
    code←('\b[0-9]+\b'⎕R{cf⍎⍵.Match})code
    pairs←⍺{⍵ ⍺}¨'_'⍴¨⍨⍳≢⍺
    defc←{∊'#define '(⍵[1])' '(⍵[2])}¨pairs
    _←ra¨pairs
    fr←∪('[A-Z]+'⎕S{⍵.Match})code
    ids←{⍵/⍨1≠≢∘∪¨⍵}{'$_'[1+2⊥⍣¯1⊢⍵]}¨⍳(⊢+∘⌈2∘⍟)≢fr
    _←ra¨ids{⍺ ⍵}¨fr
    defc,⊂code
}

It supports hexadecimal numbers, decimal numbers, providing custom names to define, automatically replaces uppercase names with their non-alpanumerically encoded counterparts and does the same to numeric constants. I can also insert APL code between dollar signs to have it evaluated and put in the code.

But that’s not everything just yet - I’d like to have a C preprocessor here, something that automatically filters out defines, and runs the APL program I just made. A small shell script tied together with a processing Perl script will do:

my @a = ();
# Slurp everything from STDIN.
$_ = do { local $/; <STDIN> };
# Remove garbage left after gcc.
s/^#.*$//gm;
# Process "%define" statements.
s/^%define[ \t]+([A-Za-z_][A-Za-z0-9_]*)$/push(@a, $1);"";/gem;
# Strip newlines (they must be replaced with a space, because
# a space is functionally equal to a newline by the C standard).
s/\n/ /g;
# Input for the APL program.
# First line - defined stuff
print (join " ",@a);print "\n";
# Second line - the code with leading/trailing whitespace removed.
s/^\s+|\s+$//g;print;print "\n";
#!/bin/bash
# Preprocess the input, pass it through initial directive processing, turn it
# into a garbled mess using the APL program.
gcc -E -I. - | \
    perl hlc.pl | \
    DYALOG_LINEEDITOR_MODE=1 /opt/mdyalog/18.0/64/unicode/mapl -script llc.apl

Last but not least, I wanted to use the C preprocessor for a purpose. Transforming loops manually isn’t too fun, so a few pre-defined macros in a library file the programmer can include won’t hurt:

// FOR1(f, f2, int, int i, 0, i < 10, i + 1, int x, x + 1, x)

#define FOR5(name, name2, r, i, v, cond, postop, a1, a2, a3, a4, a5, an1, an2, an3, an4, an5, d) \
    r name2(a1,a2,a3,a4,a5,i){return(cond)?name2(an1,an2,an3,an4,an5,postop):d;} \
    r name(a1,a2,a3,a4,a5){return name2(an1,an2,an3,an4,an5,v);}

#define FOR4(name, name2, r, i, v, cond, postop, a1, a2, a3, a4, an1, an2, an3, an4, d) \
    r name2(a1,a2,a3,a4,i){return(cond)?name2(an1,an2,an3,an4,postop):d;} \
    r name(a1,a2,a3,a4){return name2(an1,an2,an3,an4,v);}

/* ... */

// WHILE1(f, int, x > 5, int x, x - 1, x)

#define WHILE5(name, r, cond, a1, a2, a3, a4, a5, an1, an2, an3, an4, an5, d) \
    r name(a1,a2,a3,a4,a5){return(cond)?name(an1,an2,an3,an4,an5):d;}

#define WHILE4(name, r, cond, a1, a2, a3, a4, an1, an2, an3, an4, d) \
    r name(a1,a2,a3,a4){return(cond)?name(an1,an2,an3,an4):d;}

/* ... */

// DO_WHILE1(f, f2, int, x >= 0, int x, x - 1, x)

#define DO_WHILE5(name, name2, r, cond, a1, a2, a3, a4, a5, an1, an2, an3, an4, an5, d) \
    r name2(a1,a2,a3,a4,a5){return(cond)?name2(an1,an2,an3,an4,an5):d;} \
    r name(a1,a2,a3,a4,a5){return name2(an1,an2,an3,an4,an5);}

#define DO_WHILE4(name, name2, r, cond, a1, a2, a3, a4, an1, an2, an3, an4, d) \
    r name2(a1,a2,a3,a4){return(cond)?name2(an1,an2,an3,an4):d;} \
    r name(a1,a2,a3,a4){return name2(an1,an2,an3,an4);}

/* ... */

Now, I can simplify (for some definition of this word) my brainfuck interpreter using these macros:

#include "sym_c.h"
%define return
%define int
%define char
DO_WHILE4(B,A,int,G,char*D,int E,int F,int G,D,E+F,F,G+(F*(D[E+F]=='['?1:D[E+F]==']'?-1:0)),E);
WHILE7(C,int,E<I,int F,int(*G)(int),int(*H)(),char*D,int I,int*J,int E,*(J+=D[E]=='>'?1:D[E]=='<'?-1:0)+=D[E]=='+'?1:D[E]=='-'?-1:D[E]=='.'?(G(*J),0):D[E]==','?(H()-*J):0,G,H,D,I,J,(D[E]=='['?(!*J?B(D,E,1,1):E):D[E]==']'?(B(D,E,-1,1)-1):E)+1,0)
int RUN_BF(int(*G)(int),int(*H)(),char*D,int I){int F[65536]={0};return C(0,G,H,D,I,F,0);}

Finally, I decided to put all of the code on Github - feel free to play with the compiler yourself.