A conceptual foundation for programming languages and self-compilation

January 31, 2021

A while ago I became interested in programming languages - how they work and how they are created and developed. A compiler or interpreter is in some ways just an ordinary program, but their significance and the fact they take other programs as input can make them pretty hard to understand. Throw in self-referential concepts like bootstrapping, self-hosting, and quines, and you’ve got good ingredients for a headache.

So, I tried to put together a simple set of concepts so I could explain to myself: what is a programming language, really? And what is a program? In this article, I hope to give an easy-to-understand perspective on these questions!

We’ll start with our most basic definition.

Definition. A program is some string of symbols.

The actual form of a program and details of these symbols is irrelevant to us. The definition serves only to highlight that a program is an inert object; on its own, it means nothing and does nothing. We need something else to give it meaning. Enter the language:

Definition. A programming language is a function which maps a set of programs to a set of functions.

Let’s illustrate this. Suppose \(L\) is a language and \(p\) is a program which is valid in \(L\). \(L\) maps programs to functions, so \(L(p)\) is a function. It represents the abstract notion of an action, or what \(p\) “does.” \(L(p)\) maps some input \(x\) to some output \(y\) according to the interpretation of \(p\). In other words,

\[L(p)(x) = y.\]

In some ways we might think of this as actually being a computer, i.e. something that assigns meaning or action to programs. However this is an effective way to think about programming languages themselves.

Another way to write this is

\[\begin{align*} L(p) = f \\ f : X \to Y \\ f(x) = y \end{align*}\]

where we’ve given \(L(p)\) another name and explained that it maps from some set of inputs \(X\) to a set of outputs \(Y\).

Just with this, we can already arrive at some interesting definitions you may have heard of. (Don’t worry if you haven’t - these are mainly to give some examples to the above definitions.)

Definition. A program \(q \in L\) is a quine if \(L(q) = x \mapsto q\).

\(x \mapsto q\) is like mathematician’s lambda notation; on the left is a variable, and on the right is an expression in terms of that variable.

So this definition means that \(L(q)\) is a function which, taking any input, gives the program \(q\). Informally: \(q\) is a program which prints itself! (Quines are pretty fascinating in themselves: have a read of this great overview by David Madore, or the huge list of examples on Rosetta Code.)

Note that I’ve written \(q \in L\), which doesn’t really make sense, as \(L\) is a function, not a set. I’ll use this to mean that \(q\) is in the domain of \(L\), i.e. \(q\) is a valid \(L\)-program.

Here’s another fun definition:

Definition. A program \(p\) is a polyglot in languages \(L, M\) if \(L(p) = M(p)\).

I.e. \(p\) has the same meaning in the language \(L\) as it does in language \(M\).

Polyglots are also fascinating in practice; here are some examples. There are also polyglot quines!

And one more example. A code generator takes input data and emits a program.

Definition. \(p \in L\) is a code generator for \(M\) if

\[\begin{align*} L(p) = f \\ f : X \to M. \end{align*}\]

This means \(L(p)\) is a function which maps elements from a set \(X\) (some set of all data) to programs in the language \(M\) (again, treating the function \(M\) as a set). We might also write this as:

\[L(p) : X \to M.\]

Implementing a new language

Can we implement a new language in this system? We need to create a program in some existing language \(L\) which takes as input programs in our new language, call it \(M\).

Let \(p \in M\) be a program. There are two natural choices we have.

Firstly, we could write a program \(n \in L\) such that

\[\begin{align*} L(n) = f \\ f : M \times X \to Y \\ f(p, x) = M(p)(x). \end{align*}\]

In English, this means \(L(n)\) is a function which maps a \(M\)-program together with an input \(x\) to the output of that \(M\)-program applied to \(x\). Does this seem familiar at all…?

It’s actually an interpreter for \(M\)!

Considering a program as a function, an interpreter is sort of like “function application” - \(\text{apply}(p, x) = p(x)\). It takes the program and the program’s input, and computes the program applied to the input.

Our second option is to write a program \(c \in L\) such that

\[\begin{align*} L(c) = f \\ f : M \to L \\ L(f(p)) = M(p). \end{align*}\]

What are we saying here? First, \(L(c)\) is a function between programs. It maps programs in \(M\) to programs in \(L\). The last line is the condition that an \(M\)-program maps to an \(L\)-program which has the same meaning. A program which translates programs, preserving their meaning… Well, this is just a compiler!

A compiler is different to an interpreter in that the program and the program’s input come in at different times. The interpreter takes them at the same time, but the compiler takes only the program, and produces another program which can be given the input at a later time.

A shorter way of writing that \(L(f(p)) = M(p)\) for all \(p\) is to use function composition notation:

\[L \circ f = M.\]

This concisely expresses the idea that the meaning of a program in \(M\) is the same as its meaning in \(L\) after \(f\).

The computer

So far we’ve talked about programs, which are inert objects, and languages, which assign programs meaning. However, we haven’t talked about computers, the things which actually enact the meaning of programs.

Fundamentally, there’s no theoretical difference between you or I stepping through an algorithm on paper, and a computer running a program. However, it is something that we distinguish in practice, because computers are just so much faster than us at this! So we need some idea of a “computer language” or what it means to “run a program” on a computer.

Well, a computer language is nothing special. Computer processors are electronic circuits which read machine code instructions from a memory device. But if we want to write a program in a more human-friendly language, but have the processor do the computation, we’re going to need some set of programs which allow us to compute the language in terms of the machine code.

Definition. Let \(L, M\) be languages. A set of programs \(S\) reduces \(M\) to \(L\) if, for any program \(p \in M\), \(M(p)\) can be composed from functions derived starting from only \(S\) and knowledge of \(L\).

Informally we can think of this definition as being: \(S\) reduces \(M\) to \(L\) if we can use \(S\) to run any \(M\)-program on an \(L\)-computer.

Compilers and interpreters come into this just as you might expect:

Proposition. A set of programs \(S\) reduces \(M\) to \(L\) if at least one is true:

  1. A compiler \(L(c) : M \to L\) is in \(S\);
  2. An interpreter \(L(n) : M \times X \to Y\) is in \(S\);
  3. There exists a language \(Q\) such that \(S\) reduces \(M\) to \(Q\) and \(Q\) to \(L\).

I won’t prove this here, as its not particularly interesting. The most notable is case (3), which express the fact that language reduction is transitive.

(Mathematical aside: Our definition requires us to construct the meaning of a program by composing functions “derived starting from only \(S\) and knowledge of \(L\).” But since the input of an interpreter is \(M \times X\), to prove (2) we need to invent a function \(x \mapsto (p, x)\) to compose with it. This sort of breaks our definition of reduction. One solution is to “curry” the interpreter so it maps to a function; so instead of \(L(n)\) being \(M \times X \to Y\), we have \(M \to (X \to Y)\). Then we can say \(M(p) = L(n)(p)\).)

Here’s a very simple example. {/usr/bin/gcc} reduces C to x86 machine code. Why? By use of case (1): a compiler exists in the set. (I wrote /usr/bin/gcc to highlight the fact this is the compiled x86 binary, not the GCC source code.)

Now a more complex example. Suppose we’re running JSLinux, a JavaScript PC emulator running Linux, in RISC-V mode, in our web browser on an x86 computer. Then we have \(S\) = {JSLinux emulating RISC-V, /usr/bin/gcc in JSLinux, web browser} reduces C to x86.

To show this, we invoke (3) twice with JavaScript and RISC-V.

  • \(S\) reduces C to RISC-V machine code by (1) with the compiled GCC in JSLinux.
  • \(S\) also reduces RISC-V to JS by (2) with JSLinux (as it emulates a RISC-V processor in JavaScript).
  • And finally, \(S\) reduces JS to x86 by (2) with the web browser, which interprets JS.


Some compilers compile for the languages they’re written in. They’re called “self-hosting:”

Definition. Let \(L\), \(M\) be languages. A program \(c \in M\) is a self-hosting compiler for \(M\) if \(M(c) : M \to L\) is a compiler.

Now, this really is a bit of a brain-bender - how can we turn this compiler into something we can run, so we can use the language? A compiler written in the language it compiles… in order to compile it, we need to compile it! But how can we compile the compiler??

Remember that “running a program” is only possible if we can reduce it to some target language. So, we need to reduce \(M\) to \(L\). The problem is \(\{c\}\) alone doesn’t reduce any languages. \(\{c, M(c)(c)\}\) does reduce \(M\) to \(L\), but this is still circular, as we can’t work out \(M(c)\) with just \(L\).

How do we get around this?

In theory it’s very simple–we just need another set of programs \(B\) which makes \(B \cup \{c\}\) reduce \(M\) to \(L\). By our proposition above, \(B\) can just be one already-existing \(M\)-compiler or \(M\)-interpreter in \(L\). However, this still isn’t very satisfactory… What if we’re creating a new language, and no alternative implementations exist?


Bootstrapping is the process of creating \(M(c)(c)\), the self-compiled compiler, without using \(M\) directly. It turns out that we can use a different, less capable language to bootstrap.

Proposition. Suppose

  • \(L, M\) are languages
  • \(S\) is a set of programs
  • \(c \in S\) and \(M(c) : M \to L\) is a self-hosting compiler.

Then \(S\) reduces \(M\) to \(L\) if there exists a program \(b \in S\) and some language \(N\) such that

  1. \(S\) reduces \(N\) to \(L\)
  2. \(L[N(b)(c)](c) = M(c)(c)\).

Proof. By assumption (1), \(S\) reduces \(N\) to \(L\), so we can obtain \(a = N(b)(c)\), which we know by (2) is a valid program in \(L\). \(L(a)(c) = M(c)(c)\) by assumption (2). \({M(c)(c)}\) reduces \(M\) to \(L\), because it is an \(L\)-program which compiles \(M\) to \(L\). Therefore, \(S\) reduces \(M\) to \(L\). □

So what is this really saying? What does \(L[N(b)(c)](c) = M(c)(c)\) really mean? And what’s the point?

As I wrote before, it basically allows us to use a different, less capable language to bootstrap our compiler.

The strongest similarity between \(N, b\) and \(M, c\) we could state would be \(N(b) = M(c)\), i.e. \(b\) is a complete re-implementation of \(c\)’s behaviour in \(N\) - for all \(M\)-programs.

The next strongest would be \(N(b)(c) = M(c)(c)\) i.e. \(b\) is an \(M\)-compiler, but only for the input program \(c\).

\(L[N(b)(c)](c) = M(c)(c)\), the statement in the above proposition, is even weaker! It says that \(b\) is like an \(M\)-compiler, but only enough such that when compiling \(c\), it produces an \(L\)-program which actually compiles \(c\).

The key part of this is that \(c\) gets “compiled” twice. The first time, we create an \(L\)-program which compiles \(M\)-programs, just like \(M(c)(c)\) - but there’s no requirement it be the same as that program. There’s also no requirement it be valid for any program other than \(c\). The second time, we use this compiler we just created on \(c\) again. Because it is an \(M\)-compiler (for \(c\)), when we compile \(c\) we get the full \(M(c)(c)\) we wanted.

This is useful in practice for several reasons:

  • \(b\) only needs to understand the subset of \(M\) which is used by \(c\). On the off-chance that \(c\) doesn’t fully use \(M\)’s syntax, we can omit implementing some of it. However, this isn’t the only benefit. Lots of languages have features which help to catch bugs at compile-time or provide extra assurances that programs are correct. We can completely ignore these features for the bootstrap. (Technically this might mean \(b\)’s input isn’t strictly a subset, just that \(c\) is in its intersection with \(M\).)

  • \(b\) does not need to produce \(L\) code to the same standard as \(c\). Lots of work that goes into compilers isn’t developing the input language; it’s making the output programs more efficient. We don’t need to replicate any of that effort in our bootstrap, because our inefficient compiler can just compile the efficient one.

A real-world example of this bootstrapping is mrustc. The language Rust has no widely-recognised bootstrapping method - each new version of the compiler is compiled with the previous version. mrustc attempts to compile a recent version of Rust “from scratch” with a different language. And as we’ve worked out, in doing that it can be less optimising and also can forgo many of Rust’s correctness-checking language features. The authors say it themselves in the readme:

mrustc works by compiling assumed-valid Rust code (i.e. without borrow checking) … This works because the borrow checker doesn’t have any impact on the generated code, just in checking that the code would be valid.


This concludes about everything I wanted to cover. I hope that this notation/approach to programming languages is clear and has helped you understand these quite strange concepts!