I am not going to compete with numerous books on numerical analysis. There are also a lot of good tutorials on the Internet (see e.g. the links below).

This article gives some basics of the floating-point arithmetic supported by practical examples, recipes and tricks. This work was inspired by the excellent book by George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler. Some of my own examples were replaced with similar examples from this book.

If we have the calculus, why do we need in addition numerical methods for computer calculations? A common answer is "computer calculations are inaccurate". This is not quite correct. These are two completely different types of calculations.

The calculus and higher analysis operate with the infinite set of objects called real numbers. The arithmetic operations with real numbers are governed by the following axioms:

**Closure Axiom.**For real numbers*a*and*b*, (*a***op***b*) is a unique real number (**op**is any arithmetic operation: +, -, *, /; b ≠ 0 for / ).**Commutative Axiom.**For real numbers*a*and*b*,*a***op***b*=*b***op***a*, (**op**= *, + ).**Associative Axiom.**For real numbers*a*,*b*and*c*, (*a***op***b*)**op***c*=*a***op**(*b***op***c*).**Identity Axiom of Addition.**For any real number*a*,*a*+ 0 = 0 +*a*=*a.***Identity Axiom of Multiplication.**For any real number*a*,*a**1 = 1**a*=*a.***Additive Inverse Axiom.**For any real number*a*, there exists a unique real number -*a*such that*a*+ (-*a*) = -*a*+*a*= 0. The number -*a*is known as the additive inverse (negative) of*a*.**Multiplicative Inverse Axiom.**For any nonzero real number*a*, there exists a unique real number (1/*a*) such that*a**(1/*a*) = (1/*a*)**a*= 1.**Distributive Axiom.**For any real numbers*a*,*b*and*c*, (*a***op**_{1}*b*)**op**_{2}*c*=*a***op**_{2}*c***op**_{1}*b***op**_{2}*c*(**op**_{1}= +, - ,**op**_{2}= *, / ).

Computers operate with floating-point numbers. Each floating-point number *x* has the value

(1) |

Usually the numbers are normalized, i.e. *d*_{1} > 0.
The operations with such numbers **violate** almost all of the above axioms.
This is not at all the arithmetic we know. Let us illustrate this with a simplified example.

The set of numbers (1) with the base *β* = 2, the precision *t* =
2 and the exponent range *L* = -1, *U* = 2 can be presented in the following graphical form:

The set of floating-point numbers

You can visually check the correctness of the following general statements.

First, this set is not a continuum, or even an infinite set. The number are not equidistant.

Obviously the range of numbers is limited (to [-3,3] in our example).

For any positive **real** number always exists a smaller positive number. This property of real
numbers is of a fundamental significance in the higher analysis. This is not
true for ** floating-point** number: there is always a finite gap between zero
and the smallest non-zero number (±1/4 in our example).

The result of the arithmetic operations does not necessarily belong to the set of floating-point numbers (e.g. 2 + 1/4, 1/3 ). The necessity to map the result to some floating-point numbers causes so-called round-off errors. However, it is impossible to map to any number if the result is outside the limits of the set. This is called the overflow error.

The order of operations matters. For example (2 + 3/2) - 1 causes the overflow, but (2 -1) + 3/2 gives only a round-off error.

The numbers less than the smallest non-zero values a rounded to zero. This can cause a catastrophic loss of accuracy. This effect is called the underflow.

The round-off errors are machine-dependent. In more general way,
the accuracy of floating-point arithmetic can be characterized by *machine
epsilon*, the smallest number ε_{ 0} such that

1 + ε_{ 0} > 1 |
(2) |

Many numerical algorithms use the value of machine epsilon for the optimization of accuracy.

In the calculus and higher analysis a solution is often presented as a result of some infinite process (series, succession, limit). Infinite processes cannot be implemented on computes because of a finite speed of calculations and accumulation of the round-off errors. Therefore, infinitely small objects have to be replaced with finite elements and/or processes must be terminated at some point. All this is the source of errors as well.

As we see, there are various sources of errors. Once errors are generated, they propagate through calculations. Some algorithms can amplify the errors which causes numerical instability.

All numerical experiments are performed on PC with Intel processor using single and double precision.

Type | Range | Significant digits | Size in bytes |

Single | 1.5 10^{-45} .. 3.4 10^{38} |
7 - 8 | 4 |

Double | 5.0 10^{-324} .. 1.7 10^{308} |
15 - 16 | 8 |

Test projects for Borland's C++Builder 6 and Delphi 7 are available for **download**.

The solution by summation is

Pascal | C++ |
---|---|

var i: integer; sum: double; sum = 0.0; for i := 1 to 100 do begin sum := sum + 0.1; if sum = 1.0 then break; end; writeln(sum); |
double sum = 0.0; for(int i=1; i<=100; i++) { sum += 0.1; if (sum == 1.0) { break; } } cout << sum << endl; |

The result is sum = 9.99999999999998. On the other hand, (10.0*(0.1) = 1.0 ) is true. However you should never rely on the equality of floating-point numbers.

The obvious solution is

Pascal | C++ |
---|---|

eps := 1.0; while (1.0 + eps) > 1.0 do begin writeln(eps); eps := eps/2.0; end; |
eps = 1.0; while ((1.0+eps)> 1.0) { cout << eps << endl; eps /= 2.0; } |

The last printed value must be the machine epsilon. However, for Intel processor regardless of the precision of eps it gives 1.08420217248550E-0019 , see the console projects macheps. This is because this small piece of code was optimized and the internal processor precision was used. The following code

Pascal | C++ |
---|---|

eps := 1.0; repeat writeln(eps); eps := eps/2.0; sum := 1.0 + eps; until sum <= 1.0 ; |
eps = 1.0; do { cout << eps << endl; eps /= 2.0; sum = 1.0 + eps; } while (sum > 1.0); |

gives ε_{ 0} = 1.19209289550781E-7 for single float and ε_{ 0} = 2.22044604925031E-16 for double.

It is proven that the series

(3) |

diverges (tends to infinity as ln n). Let us check this with a computer. The summation is terminates when the sum stops changing:

Pascal | C++ |
---|---|

sum := 0.0; n := 1.0; repeat sum1 := sum; sum := sum + 1.0/n; n := n + 1.0; until sum1 = sum; |
sum = 1.0; n = 1.0; do { sum1 = sum; sum = sum + 1.0/n; n += 1.0; } while (sum > sum1); |

For the single/float numbers the summation stopped at n = 2097153 and S = 15.4036827087402.
The result is much less than the infinity because 1/n < sum*ε_{ 0}
. How long will it take to run with the double precision?

It is proven that the series

(4) |

converges for any finite *x*. In this experiment the summation was
terminated when a term became less than the threshold.

Precision | x |
Threshold ε | Sum | exp(x) | Valid digits |

Single | -9.5 | 1.0e-10 | 2.13608618651051E-5 | 7.48518298877006E-5 | 0 |

Double | -9.5 | 1.0e-15 | 7.48518299667056E-5 | 7.48518298877006E-5 | 9 |

Double | -19.5 | 1.0e-15 | 5.54447786514606E-9 | 3.39826781949507E-9 | 0 |

We can see the single precision result for x = -9.5 has no correct digits. Switching to the double precision only moves the problem to larger values.

For this problem there is much better cure. Let us use the identity

exp(-x) = 1/exp(x) |
(5) |

The results have all possible correct digits:

Precision | x |
Threshold ε | 1/Sum | exp(-x) | Valid digits |

Single | 9.5 | 1.0e-10 | 7.48518368560357E-5 | 7.48518298877006E-5 | 7 |

Double | 19.5 | 1.0E-15 | 3.39826781949507E-9 | 3.39826781949507E-9 | 15 |

This trick is discussed in the section "Error propagation".

- Computer Methods for Mathematical Computations, by George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler, Prentice Hall, Englewood Cliffs, New Jersey, 1977.
- Floating point.
- The Floating-Point Guide - What Every Programmer Should Know Floating-Point Arithmetic. - html
- The Floating-Point Guide - What Every Programmer Should Know Floating-Point Arithmetic. - pdf
- IEEE 754-1985.
- Floating point number representation.
- Five Tips for Floating Point Programming.

© Nikolai Shokhirev, 2001 - 2017

Count: