12 July 2020


Computers store all data as a sequence of bits.


When we wish to perform numerical computations on a computer, we must decide how to represent our numbers as sequences of bits for the computer to manipulate.

Usually, computers interpret numbers directly as integers. In decimal, the sequence of digits \(528\) represents eight 1’s, two 10’s, and five 100’s. We can do the same for binary numbers: \(1101\) represents one 1, zero 2’s, one 4, and one 8.


However, what if we want to represent numbers with fractional parts, such as \(0.5\) and \(1.25\)?

A simple way to represent fractional numbers, which I will discuss in this article, is to treat them as integers, but with the decimal point moved some number of places. For example, in decimal we could store \(0.5\) as \(050\), and \(1.25\) as \(125\). We then interpret these numbers in our calculations as if there was a decimal point before the first two digits. This method is called fixed-point representation, because the decimal point is in a fixed position!

More generally, what we have done here is decide on a common representation factor that we multiply into our numbers to store them. In this case we chose the factor \(F = 100\). To represent a number \(x = 1.25\), we store \(xF = 1.25 \times 100 = 125\).

Precision and error

Now, let’s consider what happens when we go to represent \(x = 2.468\). We multiply by our factor to obtain \(xF = 2.468 \times 100 = 246.8\). However, we have a problem: we can’t store this number in a computer as it’s not an integer. Therefore, we’ll have to round it, to \(247\).

In this case, there is error in our representation; we tried to represent \(2.468\) but what we got was a representation of \(2.47\). Essentially, we can only represent numbers to a certain precision, which describes the smallest difference we can measure between two numbers. We can represent differences of up to \(1\) in \(xF\) (because it has to be an integer), so our precision for \(x\) is \(1/F\). The error caused by rounding due to precision is called round-off error. We will explore more sources of error later.


Now we need to work out how to add, multiply, and do other computation with fixed-point, so we can actually use it!

Let’s work out how to add \(x\) and \(y\). They are represented as \(xF\) and \(yF\). (We will assume these are exact.) We need to get \((x+y)F\), the representation of \(x+y\). This is easy; we can use normal integer addition:

\[(x+y)F = xF + yF\]

The final step is determining if there’s any error in our computation. In this case, there isn’t, because integer addition is exact. Here’s some code; fixpt is our defined type for fixed-point numbers:

fixpt fp_add(fixpt x, fixpt y) {
    return x + y;


Now, how do we multiply \(x\) and \(y\)? We need to get \(xyF\) from \(xF\) and \(yF\).

\[xyF = \frac{xyF^2}{F} = \frac{xF \times yF}{F}\]

Great! Now, is there any error?

Unfortunately there is here, because integer division is not exact. When we divide two numbers with integer division, we get a result which has been rounded down to the nearest integer; this means we have an error somewhere between \(0\) and \(-1\). In another words, when we integer-divide \(xyF^2\) by \(F\), instead of getting \(xyF\), we actually get

\[xyF^2 // F = xyF + \delta \quad \text{where } -1 < \delta \leq 0,\]

where double-slash represents integer division.

Another way of writing this is by setting \(\epsilon = \delta/F\). Then we can say that our actual result is

\[\begin{align*} xyF^2 // F &= xyF + \delta \\ &= (xy + \epsilon)F \quad \text{where } -1/F < \epsilon \leq 0. \end{align*}\]

In other words, the actual result of our multiplication has an error of at most \(-1/F\).

fixpt fp_mul(fixpt x, fixpt y) {
    return x * y / FACTOR;  // integer division rounds down!

The good news is that we can essentially eliminate this error by using a division routine which rounds to the nearest integer, instead of always rounding down. With this, we have \(-1/2 < \delta \leq 1/2\) after the division, and so our final error \(\epsilon\) is such that \(-1/2F < \epsilon \leq 1/2F\).

fixpt fp_mul(fixpt x, fixpt y) {
    return rounding_divide(x * y, FACTOR);  // more accurate

Why does this “essentially eliminate the error?” It’s still not zero!

Let’s consider: what is the significance of the quantity of the value \(1/2F\)? We established earlier that the precision of a fixed-point scheme with factor \(F\) is \(1/F\). In other words, when we want to represent a number, we have to round it to the nearest multiple of \(1/F\). And the very furthest we can be from such a multiple is half-way between two of them… i.e. at a distance of \(1/2F\)!

What this means is that, if the final result of our computation is within \(1/2F\) of the correct answer, it is no less accurate than the round-off error from just directly inputting the correct answer!

Getting the accuracy within half of the precision is the ultimate goal when designing a numerical computation. Any improvement to that value will be unmeasurable. However, it’s important to realise that there is still error here.

I hope this article gave you a reasonable understanding of these fundamentals of numerical analysis. I might some day write a follow-up about floating point and some more complex computations.

☜ All posts · Subscribe