19 June 2020

I have a newer article which explores this in more detail.

I’m currently working on implementing a programming language for a retro 8-bit computer. It has been a lot of fun, but there have been many, many intellectual challenges to overcome, with new and weird concepts to understand.

Two such concepts are the bootstrapping and self-hosting of programming lanuage compilers. I will attempt to explain these in this article (as much to myself as to you).

Self-hosting compilers

First, lets clarify some terms. We’ll say a compiler is a computer program which transforms code written in one language (the input) to code written in another (the target). As the compiler is itself a program, it will be implemented in some third language (the source). For example, we could imagine a Ruby program which compiles C to ARM assembly.

Usually, though, people don’t write C compilers in Ruby. Most C compilers are written in… C.

This kind of construction, when the source and input languages are the same, is called a self-hosting compiler. The idea is that the compiler can compile its own source code. It might be paradoxical to think about how this could have happened, so let’s explore the principle.

Bootstrapping

Creating a self-hosting compiler is called bootstrapping. There are two general ways it can be done:

The first way is to write or obtain an L interpreter or L-to-T compiler written in some other language M. Then we rewrite the compiler in L. Finally, we compile the L-to-T compiler written in L with the L-to-T compiler written in M. Novel languages have no option but to take this approach, due to the lack of any tools which can process them. For example, the Rust compiler was initially written in OCaml, until it became mature enough that it could be re-written in Rust itself.

The second way is to take an already existing self-hosted L-to-S compiler and modify it to produce T code. Then I can compile my new L-to-T compiler source code with the L-to-S compiler. At this point we have an S program, from L source, which compiles L to T. We can stop here, unless we want our L-to-T compiler as an executable T program, instead of an S program. In that case we’ll have to recompile our new L-to-T compiler with our S program to produce an L-to-T compiler T program. (Confused yet?)

Extending a language

Now that we have a self-hosting L compiler, suppose we wish to make some extension to L, creating a larger language L2. We can alter a self-hosting L-compiler to understand this, which will give us an L2-compiler written in L. As L is a subset of L2, this is still self-hosting! After recompiling itself, we can then use L2’s new features in the source code of our compiler, giving us a proper L2-compiler written in L2. This iterative process is how self-hosting compilers are developed day-to-day and improved upon.

An interesting case to think about is when only the language T is currently implemented. In an extreme case, we might consider T to be the machine code for a very early CPU. To implement a compiler, we would have to pick our first bootstrapping method with M = T. In other words: we would hand-write in T a program which could compile a more convenient and human-readable representation of T into actual T code–or in other words, an assembler! Then we could use this assembler to create a more complex assembler, or perhaps a compiler in a slightly higher-level language, and iterate on this as much as we like. This is how the first programming languages were developed.

GCC

For a practical example of bootstrapping, consider the C compiler GCC. GCC is not bootstrapped from scratch; in fact it requires an already-existing C compiler to build. It undergoes a three-stage bootstrapping process:

  • Stage 1: A minimal C compiler is built with the host (existing) compiler
  • Stage 2: GCC is built with the host-compiled minimal compiler
  • Stage 3: GCC is built again with the host-compiled GCC

Why the three stages? The first ensures that we have a compiler which supports the features we need for GCC. The second gives us our first GCC. The third stage, though technically unnecessary from a correctness point of view, ensures that we get the advantages of GCC code generation and optimisation in our compiler executable. This could be quite important; we don’t want our first minimal compiler to be any more complicated than necessary (so we don’t have to develop two compilers at the same time!), so our emitted machine code might be quite unoptimised. (This third stage also corresponds to my previous discussion about compiling an L-to-T compiler source with an L-to-T S program to produce an L-to-T T program.)

This whole topic is very confusing–I got a lot of headaches writing this article, so I hope it was helpful and/or interesting! I’m not an expert in this area, so if I’ve made any mistakes, or things can be explained simpler, then feel free to send me a Tweet.