Everything I wish I knew when learning C

18 November 2022

Learning C was quite difficult for me. The basics of the language itself weren’t so bad, but “programming in C” requires a lot of other kinds of knowledge which aren’t as easy to pick up on:

C has no environment which smooths out platform or OS differences; you need to know about your platform too
there are many C compiler options and build tools, making even running a simple program involve lots of decisions
there are important concepts related to CPUs, OSes, compiled code in general
it’s used in such varied ways that there’s far less a centralised “community” or style than other languages

This page is a living collection of summaries, signposts, and advice for these broader points that made my journey with C and other compiled languages easier. I hope it’s useful to you! (And if it is, make sure to subscribe for any updates.)

General resources
Good projects to learn from
Compilation, linking, headers, and symbols
Undefined behaviour (UB)
Do not use these functions
Arrays aren’t values
Essential compiler flags
Three types of memory, and when to use them
Naming conventions
static
The struct method pattern
const
Platforms and standard APIs
Integers
Macros vs const variables
Macros vs inline functions

General resources

TutorialsPoint C: very basic intro
awesome-c: big list of libraries and tools
cppreference: technical reference for the C language and standard library

Good projects to learn from

Sometimes it’s helpful to just read some small, self-contained C code to get to grips with how it looks.

Bloopsaphone, a Ruby library for synthesising sounds which has a small C module at its core. Has a small number of concepts and a good structure.
esshader, a GLSL shader viewer like ShaderToy.com. A small program which just glues a few libraries together.
Brogue CE, a roguelike video game, >30k LOC. I maintain this, and many of our contributors have sharpened their C by working on it.
Simple Dynamic Strings (sds). Has one .c and .h file each, and is a good example of how you might do more complex resource management.
stb single-file libraries. These are small to medium-sized modules designed to be highly portable, including targetting embedded devices and games consoles.

Compilation, linking, headers, and symbols

Some basics on how C compilation works, because it will help other things make sense.

C code is written in .c source files. Each source file is compiled to a .o object file, which is like a container for the compiled function code in the .c file. They are not executable. Object files have inside them a table of symbols, which are the names of the global functions and variables defined in that file.

# compile to objects
cc -c thing.c -o thing.o
cc -c stuff.c -o stuff.o

Source files are completely independent of each other, and can be compiled to objects in parallel.

To use function and variables across files, we use header files (.h). These are just ordinary C source files used in a specific way. Recall above that object files only contain the names of global functions and variables—no types, macros, or even function parameters. To use symbols across files, we need to specify all this extra information needed to make use of them. We put these “declarations”¹ in their own .h file, so other .c files can #include them.

To avoid duplication, a .c file will typically not define its own types/macros etc. and will just include the header file for itself or the module/component it’s part of.

Think of a header file as a specification of an API, that can be implemented across any number of source files. You can even write different implementations of the same header, for different platforms or purposes.

When compiling a reference to a symbol that has only been declared (e.g. by an included header) and not defined, the object file will mark that this symbol is missing and needs to be filled in.

The final work of joining one or more objects together, matching up all symbol references, is done by the “linker” component of the compiler. The linker outputs complete executables or shared libraries.

# link objects to executable
cc thing.o stuff.o -o gizmo

In summary, we don’t “include” other source files in C, like we do other languages. We include declarations, and then the code gets matched up by the linker.

Undefined behaviour (UB)

Quite a lot of behaviour in C is specified by the standard as undefined. Any undefined behaviour makes the program, in theory, badly-formed, and may lead to inconsistent behaviour or crashes. Unfortunately, it is hard to remember and suprisingly easy to encounter. In many cases, compilers will patch over UB with sensible (but compiler-specific) code, making it hard to notice it at all.

Here is a strange bug we had in Brogue, only on some platforms, due to UB: Missing item names · Issue #30 · tmewett/BrogueCE

For more details see Nayuki’s Undefined behavior in C and C++ programs.

Do not use these functions

C is old and tries to be highly backwards-compatible. As such it has features that ought to be avoided.

atoi(), atol(), and friends; they return 0 on error, but this is also a valid return value. Prefer strtoi(), etc.
gets() is unsafe as no bounds on the destination buffer can be given. Prefer fgets().

See also My review of the C standard library in practice, where Chris Wellons highlights many issues across the entire standard library.

Arrays aren’t values

It’s important to realise that C, as a language, deals only with known-size pieces of data. You could probably summarise C as “the language of copying known-size values.”

I can pass a integer or a struct around a program, return them from functions, etc. and treat them as proper objects because C knows their size and hence can compile code to copy their full data around.

I can’t do this with an array. The sizes of arrays are not known in any useful way to C. When I declare a variable of type int[5] in a function, effectively I don’t get a value of type int[5]; I get an int* value which has 5 ints allocated at it. Since this is just a pointer, the programmer, not the language, has to manage copying the data behind it and keeping it valid.

However, arrays inside structs are treated as values and are fully copied with the struct.

(Technically, sized array types are real types, not just pointers; e.g. sizeof will tell you the size of the whole array. But you can’t treat them as self-contained values.)

Essential compiler flags

Compilers have so many options and the defaults aren’t very good. Here are the absolute essential flags you may need. (They are given in GCC/Clang style; syntax may vary on other compilers.)

-O2: optimise code for release builds
-g -Og: for debug builds; enable extra information for debuggers, and optimise for debugging
-Wall to enable many warnings (kind of like a linter). You can disable specific warnings with -Wno-...
-Werror to turn warnings into errors. I recommend always turning on at least -Werror=implicit, which ensures calling undeclared functions results in an error(!)
-DNAME and -DNAME=value for defining macros (useful to pass config options from the build systems to the compiler)
-fsanitize=address,undefined: for debug builds; enables two common “sanitizers,” which inject extra checks throughout the compiled code to find errors. See also all GCC instrumentation options.
-std=...: choose a standard. In most cases you can omit this to use your compiler’s default (usually the latest standard).

Three types of memory, and when to use them

Automatic storage is where local variables are stored. A new region of automatic storage is created for a function when it is called, and deleted when it returns. Only the return value is kept; it is copied into the automatic storage of the function which called it. This means that it is unsafe to return a pointer to a local variable, because the underlying data will be silently deleted. Automatic storage is often called the stack.
Allocated storage is the result of using malloc(). It survives until it is free()‘d, so can be passed wherever, including upwards to calling functions. It is often called the heap.
Static storage is valid for the lifetime of the program. It is allocated when the process starts. Global variables are stored here.

If you want to “return” memory from a function, you don’t have to use malloc/allocated storage; you can pass a pointer to a local data:

void getData(int *data) {
    data[0] = 1;
    data[1] = 4;
    data[2] = 9;
}

void main() {
    int data[3];
    getData(data);
    printf("%d\n", data[1]);
}

Naming conventions

C has no support for namespaces. If you’re making a public library, or want a “module” to have a name, you need to choose a prefix to add to all public API names:

functions
types
enum values
macros

Additionally, you should always include some different prefix for each enum, so you know which enum type the value belongs to:

enum color {
    COLOR_RED,
    COLOR_BLUE,
    ...
}

There’s no real convention about names, e.g. snake_case vs camelCase. Pick something and be consistent! The closest thing to a convention I know of is that some people name types like my_type_t since many standard C types are like that (ptrdiff_t, int32_t, etc.).

static

On a function or file-level variable, static makes it file-local. It won’t be exported as a symbol for use by other source files.

static can also be used on a local variable, which makes the variable persist between calls to that function. You can think of this like a global variable that is scoped to only one function. This can be useful to compute and store data for reuse by subsequent calls; but remember, this comes with the usual caveats of global/shared state, such as clashing with multiple threads or with recursion.

(It can seem like it has multiple meanings, since in a global scope it seems to reduce the scope of the variable, but in a function scope it increases it. Really what it’s doing in both cases is making them file-linked.)

The struct method pattern

If you learned a more featureful language before C, you might find it hard to visualise how to translate that knowledge. Here’s a common idiom which resembles object-oriented programming: the “struct method.” You write functions which accept pointers to structs to alter them or get properties:

typedef struct {
    int x;
    int y;
} vec2;

void vec_add(vec2 *u, const vec2 *v) {
    u->x += v->x;
    u->y += v->y;
}

int vec_dot(const vec2 *u, const vec2 *v) {
    return u->x * v->x + u->y * v->y;
}

You can’t extend structs or do anything really OO-like, but it’s a useful pattern to think with.

const

Declaring a variable or parameter of type T as const T means, roughly, that the variable cannot be modified. This means that it can’t be assigned to, and also that it can’t be changed if T is a pointer or array type.

You can cast T to const T, but not vice versa.

It’s a good habit to declare pointer parameters to functions as const by default, and only omit it when you need to modify them.

Platforms and standard APIs

When you pull in #include <some_header.h> it’s hard to conceptualise what you’re depending on. It will be from one of the following:

The standard C library (abbr. “stdlib”). Examples: stdio.h, stdlib.h, error.h
- This is part of the language specification, and should be implemented by all compliant platforms and compilers. Very safe to depend on.
- https://en.cppreference.com/w/c/header
POSIX, a standard for operating system APIs. Examples: unistd.h, sys/time.h
- Generally implemented by Linux, macOS, BSDs.
- Not available by default on Windows. Some misc. POSIX APIs are available if you use MinGW. For more complete support, there is the Cygwin library.
- You can view all details of POSIX headers (incl. C stdlib) at the official OpenGroup standard page (click “Headers” in the sidebar), or in section 3 man pages.
A non-standard operating system interface:
- Linux-specific APIs - documented in section 3 man pages
- Windows Win32 (FYI, a more modern C++ interface called C++/WinRT is also available.)
- (Mac’s OS APIs are historically used via Objective C (now Swift), not C.)

A third-party library, installed in a standard location.

It can be a good idea to interface with your more platform-specific code through a platform-neutral header file so it can be implemented in different ways. Lots of popular C libraries are basically just unified, well-designed abstractions over platform-specific functionality.

Integers

Integers are very cursed in C. Writing correct code takes some care:

Sizes

All integer types have a defined minimum size. On common platforms, some are larger than their minimum size, such as int, which is 32-bit on Windows, macOS, and Linux, despite being minimum 16-bit. When writing portable code, you must assume integers can never go above their minimum size.

If you want exact control over integer sizes, you can use the standard types in stdint.h, like int32_t, uint64_t, etc. There are also _least_t and _fast_t types.

Should you use these well-specified types everywhere you can? I must admit I’m torn on this question, but the more I think about it, the more I think you should—there are no downsides.² The only reason you really shouldn’t is when making an API which has to interface with very old C89 compilers which lack stdint.h. There’s also an argument for considering what the type communicates to the reader and whether the size is actually important; however by using standard types like int you are still implicitly relying on a certain size. It’s probably no worse, yet clearer, to use int16_fast_t or something over int. (However, typically no one does this, including me!)

Arithmetic and promotion

Arithmetic in C is subject to many bizarre rules which can give unexpected or unportable results. Integer promotions are especially important to be aware of.

See Nayuki’s summary of C integer rules.

char signedness

All other integer types default to signed, but bare char can be signed or unsigned, depending on the platform. As such, it’s only portable when used for strings; specify the sign too if you want a small/minimum-8-bit³ number.

Macros vs const variables

To define simple constant values, you have two choices:

static const int my_constant = 5;
// or
#define MY_CONSTANT 5

The difference is that the former is a real variable and the latter is a copy-pasted inline expression.

Unlike variables, you can use macros in contexts where you need a “constant expression,” like array lengths or switch statement cases.
Unlike macros, you can get a pointer to a variable.

Having constants actually be “constant expressions” is very useful and hence they should usually be defined as macros. Variables are better for larger or more complex values like struct instances.

If your constant is an integer, you have a third, better option, the “bare enum”:

enum {
    MY_CONSTANT = 5
}

This defines a constant expression in C, not in the pre-processor, so it can be more easily seen by debuggers etc.

In C23, you can optionally give explicit “underlying type” to an enum:

enum : size_t {
    BUFFER_LENGTH = 1024
}

Macros vs inline functions

Macros can have parameters, which can then expand to C code.

Advantages over functions:

The code is pasted right in the surrounding code, instead of compiling function call instructions. This can make code faster, as function calls have some overhead.
They can be type-generic. For example, x + y is valid syntax for any numeric type. If we made that a function, we’d have to declare them as arguments and choose their type, i.e. size and signedness, in advance, which would make it only usable in some contexts.

Disadvantages:

Repeated evaluation of arguments. Suppose we have a macro MY_MACRO(x). If x is used multiple times in the definition, then the expression x will be evaluated multiple times, because it is simply copied and pasted.⁴ Compare that with a function, where expressions as arguments are evaluated once to values and then passed into the function.
They can be error-prone because they work at the source level. It is generally a good idea to use brackets gratuituously, always around the whole macro definition itself and any arguments, so expressions don’t merge unintentionally.
```
// Instead of:
#define MY_MACRO(x) x+x
// Do:
#define MY_MACRO(x) ((x)+(x))
```

Unless you need to be type-generic, you can get the best of both worlds by defining a function as static inline. inline provides a hint to compilers that the code in the function should be compiled directly into where it is used, instead of being called. You can put static inline functions in header files, just like macros, with no issues.

Additionally, since C11 you can provide overloads of functions for different types using a special macro _Generic:

#define sin(X) _Generic((X), \
              long double: sinl, \
                  default: sin,  \
                    float: sinf  \
              )(X)

https://stackoverflow.com/questions/1410563/what-is-the-difference-between-a-definition-and-a-declaration ↩
https://stackoverflow.com/a/9837399/1561010 ↩
But not always 8 bit. char is special because it’s the smallest addressible type on the current platform, which is not required to be (but basically always is) 8 bits. The size of char in bits is available in the macro CHAR_BIT from limits.h. All other sizes in C, such as from sizeof, are in units of char. ↩
If the expression has no side effects and the compiler can figure this out, it might be optimised by common subexpression elimination. ↩