Writing a C Compiler (Part 1): Theory and Setup

2024-09-24

Inspiration

I want to preface this effort by acknowledging the Tiny C Compiler project by Fabrice Bellard (which doubtlessly inspired this project).

The world is replete with C-family lexers, parsers, tokenizers, compilers, interpreters, and transpilers; it's the lingua franca of the digital world in more ways than one, and permeates every corner of the globe. Most high-level languages need to "speak" C to exploit native libraries, access system resources, for FFI compatibility, etc.

So, why retread such well-trodden ground?

Why a C Compiler?

Why not? It seems like a fun, enriching exercise.

A more technical explanation is mostly because C is syntactically an uncomplicated language that I'm already familiar and comfortable with. With a short list of keywords, a simple grammar that can comfortably tuck into the annex of the standard document (Annex A), and practically no cooked-in corner cases, it's practically begging to be an aspiring compiler dev's first language. Plus, a C compiler being nothing new means there's a lot of prior art to borrow from if need be, as well as several impressive batteries of tests to validate well-known language behaviors, meaning there's very little guesswork to be done about whether some feature was implemented correctly or not.

I've heard it said that reading the standard (or design document, or manual, or whatever tome describes the implementation details) for a language is the only real way to fully 'get' it. After all, what better way to know how to fully exploit your language of choice than to know exactly what building blocks are available to you to rearrange? Just slogging through hundreds of pages of technical information makes for a tedious task without some way to apply it -- not unlike joylessly memorizing the formulas and theorems needed to pass highschool calculus, taught as pure mathematics divorced from a suitable, real-world context like physics. I enjoy application holding hands with theory, so I want to take the opportunity to make something with it.

Why WebAssembly?

WebAssembly (or WASM), to me, represents a fascinating evolution in the universal-web-platform ecosystem; instead of relying exclusively on JavaScript to describe logical processes, they can instead be described in a much lower-level, assembly-like syntax for a stack-based virtual machine, meaning control flow much closer to a traditional computer program. Because web standards are implemented in a (fairly) homogenous ecosystem, this means that assembly-like code can be executed almost anywhere -- which isn't a new concept by itself, considering that portable bytecodes like those of Lua or Python, have existed for decades, but is nonetheless exciting because of the ubiquity of web browsers, meaning no extra software or expertise is needed to take advantage of WebAssembly.

Besides being an exciting new way to make and distribute software, WASM's syntax is also fairly straightforward, and is described in wonderful detail in Mozilla's developer documents.

High-Level Goals

This project has the following parameters and goals:

The compiler's input will be one or more C99-compliant C source code (*.c) files.
The compiler's output will be one or more WebAssembly text format (*.wat) files, and the necessary JavaScript (*.js) boilerplate needed to load and execute the generated WASM.
- As a future goal, adding the option to output directly to WebAssembly binary format (*.wasm) files would be a logical step.
The compiler will use the tests from c-testsuite to test for adherence, and include the necessary configurations to run them.

The Compilation Process

The thousand-foot abstract of a compiler is that it takes some high-level structured information (source code) and transforms it into some other representation, traditionally into low-level machine code that a computer can make direct use of. In my case, WebAssembly is the target.

So, what then are the steps needed to transform source code into its output format?

How does one get from the left to the right?

The entire process can be (VERY) briefly summarized as:

Tokenize the input; that is, split the input source into a series of tokens, chopping up the stream of characters into meaningful parts like keywords (struct, int, etc.), symbols (%, #, !, etc.), strings ("test"), and so on.
Parse the token sequence, matching them against the logical structures defined by the language. For instance, a naive assignment expression might look like this:
```
	foo		=	22
	IDENTIFIER	SYMBOL	LITERAL
```
There will also be expressions for control flow, type definitions, function declarations, etc. A sequence of these expressions forms the logical basis for a program, and provide the data upon which to operate.
Generate the sequence of instructions in the destination format that reflects the instructions of the input language.

In the real world, there's a tremendous amount of work that happens between step 2 and 3 related to intermediate representations (a sort of compiler-specific bytecode meant to represent input source code in a uniform way) and the varying rounds of universal, hardware-specific and platform-specific optimizations done before the final, resulting output is generated.

For a proof-of-concept, aiming for aggressive optimizations done on an IR would be overkill; just getting from "C source input" to "WASM text output" will suffice.

Roadmap

The immediate goal of this project will be to support defining a very simple add() function, whose behavior will then be translated into the WebAssembly text format, and accompanied by a generated JavaScript file telling the browser how to interact with it:

int add(int a, int b)
{
	return a + b;
}

To accomplish this, the following are required:

A complete tokenizer - The C99 standard only specifies 5 distinct kinds of tokens, might as well get them all out of the way at the same time.
A minimal parser - I just want to adhere to the standard enough to define a function with some primitive, non-pointer, non-reference parameters.
A minimal generator - As above, implement just enough of the language to do some basic arithmetic and return a value.

Next Time

In my next post, I'll be discussing the C99 language standard (draft), as well as how I've chosen to implement the tokenizer.