# Fixed-point and number representations

12 Jul 2020

### Representations

Computers store all data as a sequence of bits.

$01101010$

When we wish to perform numerical computations on a computer, we must decide how to represent our numbers as sequences of bits for the computer to manipulate.

Usually, computers interpret numbers directly as integers. In decimal, the sequence of digits $$528$$ represents eight 1’s, two 10’s, and five 100’s. We can do the same for binary numbers: $$1101$$ represents one 1, zero 2’s, one 4, and one 8.

### Fractions

However, what if we want to represent numbers with fractional parts, such as $$0.5$$ and $$1.25$$?

A simple way to represent fractional numbers, which I will discuss in this article, is to treat them as integers, but with the decimal point moved some number of places. For example, in decimal we could store $$0.5$$ as $$050$$, and $$1.25$$ as $$125$$. We then interpret these numbers in our calculations as if there was a decimal point before the first two digits. This method is called fixed-point representation, because the decimal point is in a fixed position!

More generally, what we have done here is decide on a common representation factor that we multiply into our numbers to store them. In this case we chose the factor $$F = 100$$. To represent a number $$x = 1.25$$, we store $$xF = 1.25 \times 100 = 125$$.

### Precision and error

Now, let’s consider what happens when we go to represent $$x = 2.468$$. We multiply by our factor to obtain $$xF = 2.468 \times 100 = 246.8$$. However, we have a problem: we can’t store this number in a computer as it’s not an integer. Therefore, we’ll have to round it, to $$247$$.

In this case, there is error in our representation; we tried to represent $$2.468$$ but what we got was a representation of $$2.47$$. Essentially, we can only represent numbers to a certain precision, which describes the smallest difference we can measure between two numbers. We can represent differences of up to $$1$$ in $$xF$$ (because it has to be an integer), so our precision for $$x$$ is $$1/F$$. The error caused by rounding due to precision is called round-off error. We will explore more sources of error later.

Now we need to work out how to add, multiply, and do other computation with fixed-point, so we can actually use it!

Let’s work out how to add $$x$$ and $$y$$. They are represented as $$xF$$ and $$yF$$. (We will assume these are exact.) We need to get $$(x+y)F$$, the representation of $$x+y$$. This is easy; we can use normal integer addition:

$(x+y)F = xF + yF$

The final step is determining if there’s any error in our computation. In this case, there isn’t, because integer addition is exact. Here’s some code; fixpt is our defined type for fixed-point numbers:

fixpt fp_add(fixpt x, fixpt y) {
return x + y;
}


### Multiplication

Now, how do we multiply $$x$$ and $$y$$? We need to get $$xyF$$ from $$xF$$ and $$yF$$.

$xyF = \frac{xyF^2}{F} = \frac{xF \times yF}{F}$

Great! Now, is there any error?

Unfortunately there is here, because integer division is not exact. When we divide two numbers with integer division, we get a result which has been rounded down to the nearest integer; this means we have an error somewhere between $$0$$ and $$-1$$. In another words, when we integer-divide $$xyF^2$$ by $$F$$, instead of getting $$xyF$$, we actually get

$xyF^2 // F = xyF + \delta \quad \text{where } -1 < \delta \leq 0,$

where double-slash represents integer division.

Another way of writing this is by setting $$\epsilon = \delta/F$$. Then we can say that our actual result is

\begin{align*} xyF^2 // F &= xyF + \delta \\ &= (xy + \epsilon)F \quad \text{where } -1/F < \epsilon \leq 0. \end{align*}

In other words, the actual result of our multiplication has an error of at most $$-1/F$$.

fixpt fp_mul(fixpt x, fixpt y) {
return x * y / FACTOR;  // integer division rounds down!
}


The good news is that we can essentially eliminate this error by using a division routine which rounds to the nearest integer, instead of always rounding down. With this, we have $$-1/2 < \delta \leq 1/2$$ after the division, and so our final error $$\epsilon$$ is such that $$-1/2F < \epsilon \leq 1/2F$$.

fixpt fp_mul(fixpt x, fixpt y) {
return rounding_divide(x * y, FACTOR);  // more accurate
}


Why does this “essentially eliminate the error?” It’s still not zero!

Let’s consider: what is the significance of the quantity of the value $$1/2F$$? We established earlier that the precision of a fixed-point scheme with factor $$F$$ is $$1/F$$. In other words, when we want to represent a number, we have to round it to the nearest multiple of $$1/F$$. And the very furthest we can be from such a multiple is half-way between two of them… i.e. at a distance of $$1/2F$$!

What this means is that, if the final result of our computation is within $$1/2F$$ of the correct answer, it is no less accurate than the round-off error from just directly inputting the correct answer!

Getting the accuracy within half of the precision is the ultimate goal when designing a numerical computation. Any improvement to that value will be unmeasurable. However, it’s important to realise that there is still error here.

I hope this article gave you a reasonable understanding of these fundamentals of numerical analysis. I might some day write a follow-up about floating point and some more complex computations.

← All posts