C for scripters: the essential differences

21 July 2019

When first learning to program, many people start with interpreted languages, such as Ruby and Python, because such languages provide a nice, dynamic environment with quick feedback.

When I started looking into C, I found that there were a surprising number of differences between it and all the languages I knew, and I needed to learn a lot of novel concepts that we simply not relevant in interpreted languages. There were also a lack of any central, high-quality resources or communities for C which I had come to get used to with other languages. Nonetheless, C is one of the most popular programming languages in the world, not to mention is at the core of all modern operating systems, and so a good understanding of it is highly useful to a programmer. This article is a summary of the lessons I learned, in the hope it will be useful to others in my position!

C is “just” a language

When someone implements C, all they need to create is a compiler. Usage of C by programmers might involve all sorts of other tools, like build systems, configuration systems, pkg-config, etc. There is a lot of a choice, and they don’t really form a coherent ecosystem like what you may be used to in some more modern interpreted languages. To manage the complexity and learn the basics, I’d recommend only dealing with Make at first.

There is no such thing as the C

Many programming languages have what are called “reference implementations;” a single, major implementation which defines the language by its behaviour. C is not such a language; it has many different compilers, which may support many different revisions of the standard (ANSI/C89, C99, C11, etc.) and sometimes add their own extensions too. Major implementations are GCC, Clang, and MSVC.

Even after choosing a compiler, the choice doesn’t stop there, as there are hundreds and hundreds of options to change the compilation of your code. For starters, some common ones are (for the command line, with GCC or similar):

-O2 for optimising release builds
-g for enabling extra information for debuggers
-std=... to explicitly choose a standard, e.g. -std=c99
-Wall to enable many warnings (kind of like a linter), and -Werror to turn warnings into errors
-D for defining preprocessor macros
There are many options beginning with -f which tweak the generated code

Including other files works differently (header files)

With an interpreted language, reading the source code and executing it happens at the time. Including another file is often just a case of switching to read it instead, and then back to the original file.

With a compiled language like C, things are different. Each file of a project is compiled separately to produce an object file, which contains generated code and a table of exported function and variable names within it. Then, all the object files are combined by the linker which matches up the definitions of function and variables with their uses in other files.

# compile to objects
cc -c thing.c -o thing.o
cc -c stuff.c -o stuff.o

# link objects to executable
cc thing.o stuff.o -o gizmo

All functions and global variables are exported by default. Declaring with the static keyword makes them local to the file.

When using something from another file, you must “import” it with a special statement – either a function declaration or an extern variable declaration. To simplify things, most people write header files (.h) which contain such declarations, as well as other definitions like typedefs, enums, and macros, which cannot be exported. Then, you can conveniently spread code across files, and just #include the header in each one.

Libraries written by other people are usually distributed as “shared objects” (.so), otherwise known as “dynamically-linked libraries” (.dll). To use them, you must include their header files and then link with them (e.g. using the -l compiler flag).

See Storage-class specifiers on cppreference for more information about visibility and lifetimes of variables.

There is no middle-man between you and the OS

Interpreted languages are, of course, interpreted by a program which executes your code. This interpreter usually hides underlying details and creates a standard environment.

C, though, compiles straight to machine code. So to run on another operating systems or CPU architectures, you must compile it specifically for that platform. Platform-specific code may also be necessary. Distributing or porting a program can therefore become quite an involved task.

Using memory is different

Almost a follow-on to the previous point: there is no memory safety in C. What this means is that there’s nothing stopping you from reading or writing to memory which you really shouldn’t.

You can compile code which attempts to read from an invalid location, outside the process’ address space. When you run it, your OS will intercept the read and instantly kill the process – a segmentation fault, or segfault.

Though not part of the language itself, there are typically two places memory is stored by operating systems. The first is the stack. Consider the following C program.

#include <stdio.h>

int *getData() {
    int data[] = {1, 4, 9};
    return data;
}

void main() {
    int *data = get_data();
    printf("%d\n", data[1]);
}

You might expect this to retrieve and print the number 4; but it will actually retrieve invalid data. This is because the array data in getData, like almost all allocations in C, is stored on the stack. When a function returns, all its data on the stack is deallocated. Hence the data pointer in main points to an invalid location, so dereferencing it with the array access might give back garbage data. See Storage duration on cppreference for more details.

To have longer-lasting data, you must request memory from the operating system. In C, this is done with the standard functions malloc, calloc, etc. These allocate memory in the second area used for data, commonly called the heap. These functions have return type void *, which signifies an arbitrary address.

#include <stdio.h>
#include <stdlib.h>

int *getData() {
    int *data = (int *) malloc(3 * (sizeof int));
    data[0] = 1;
    data[1] = 4;
    data[2] = 9;
    return data;
}

void main() {
    int *data = get_data();
    printf("%d\n", data[1]);
    free(data);
}

Memory allocated in this way is never automatically deallocated, so can be used freely between functions. It must be freed manually with the function free. Any access after the memory has been freed will cause the OS to kill your process with a “segmentation fault” error. Also, remember to free the memory before the void * variable goes out of scope, or you’ll be claiming memory you can no longer access!

Alternatively, you can sometimes avoid the use of heap memory by allocating on the stack in the calling function and passing a pointer to the callee:

#include <stdio.h>

void getData(int *data) {
    data[0] = 1;
    data[1] = 4;
    data[2] = 9;
}

void main() {
    int data[3];
    get_data(&data);
    printf("%d\n", data[1]);
}

Conclusion

I hope this article was useful to you somehow. As a final tip, my go-to sites for C reference are the C section of cppreference and, surprisingly, the C tutorial on TutorialsPoint, which isn’t bad either.