Home

Articles
Tutorials

Floating-point arithmetic

Nikolai Shokhirev

Introduction

I am not going to compete with numerous books on numerical analysis. There are also a lot of good tutorials on the Internet (see e.g. the links below).

This article gives some basics of the floating-point arithmetic supported by practical examples, recipes and tricks. This work was inspired by the excellent book by George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler. Some of my own examples were replaced with similar examples from this book.

Floating-point computations

If we have the calculus, why do we need in addition numerical methods for computer calculations? A common answer is "computer calculations are inaccurate". This is not quite correct. These are two completely different types of calculations.

The calculus and higher analysis operate with the infinite set of objects called real numbers. The arithmetic operations with real numbers are governed by the following axioms:

  1. Closure Axiom. For real numbers a and b, (a op b) is a unique real number (op is any arithmetic operation: +, -, *, /; b ≠ 0 for / ).
  2. Commutative Axiom. For real numbers a and b, a op b = b op a , (op = *, + ).
  3. Associative Axiom. For real numbers a, b and c, ( a op b ) op c = a op ( b op c ).
  4. Identity Axiom of Addition. For any real number a, a + 0 = 0 + a = a.
  5. Identity Axiom of Multiplication. For any real number a, a*1 = 1*a = a.
  6. Additive Inverse Axiom. For any real number a, there exists a unique real number -a such that a + (-a) = -a + a = 0. The number -a is known as the additive inverse (negative) of a.
  7. Multiplicative Inverse Axiom. For any nonzero real number a, there exists a unique real number (1/a) such that a*(1/a) = (1/a)*a = 1.
  8. Distributive Axiom. For any real numbers a, b and c, ( a op1 b) op2 c = a op2 c op1 b op2 c ( op1 = +, - , op2 = *, / ).

Computers operate with floating-point numbers. Each floating-point number x has the value

          (1)

Usually the numbers are normalized, i.e. d1 > 0. The operations with such numbers violate almost all of the above axioms. This is not at all the arithmetic we know. Let us illustrate this with a simplified example.

Example of a floating-point set 

The set of numbers (1) with the base β = 2, the precision t = 2 and the exponent range L = -1, U = 2 can be presented in the following graphical form:

 

The set of floating-point numbers

You can visually check the correctness of the following general statements.

First, this set is not a continuum, or even an infinite set. The number are not equidistant.

Obviously the range of numbers is limited (to [-3,3] in our example).

For any positive real  number always exists a smaller positive number. This property of real numbers is of a fundamental significance in the higher analysis. This is not true for floating-point number:  there is always a finite gap between zero and the smallest non-zero number (±1/4 in our example).

Overflow and round-off errors

The result of the arithmetic operations does not necessarily belong to the set of floating-point numbers (e.g. 2 + 1/4, 1/3 ). The necessity to map the result to some floating-point numbers causes so-called round-off errors. However, it is impossible to map to any number if the result is outside the limits of the set. This is called the overflow error.

The order of operations matters. For example (2 + 3/2) - 1 causes the overflow, but (2 -1) + 3/2 gives only a round-off error.

The numbers less than the smallest non-zero values a rounded to zero. This can cause a catastrophic loss of accuracy. This effect is called the underflow.

Machine epsilon

The round-off errors are machine-dependent. In more general way, the accuracy of floating-point arithmetic can be characterized by machine epsilon, the smallest number ε 0 such that

1 + ε 0 > 1        (2)

 Many numerical algorithms use the value of machine epsilon for the optimization of accuracy.

Discretization and truncation errors

In the calculus and higher analysis a solution is often presented as a result of some infinite process (series, succession, limit). Infinite processes cannot be implemented on computes because of a finite speed of calculations and accumulation of the round-off errors. Therefore, infinitely small objects have to be replaced with finite elements and/or processes must be terminated at some point. All this is the source of errors as well.

 Numerical instability

As we see, there are various sources of errors. Once errors are generated, they propagate through calculations. Some algorithms can amplify the errors which causes numerical instability.

Numerical experiments

All numerical experiments are performed on PC with Intel processor using single and double precision.

 Type  Range   Significant digits   Size in bytes 
 Single  1.5 10-45 .. 3.4 1038    7 - 8  4
 Double   5.0 10-324 .. 1.7 10308   15 - 16  8

Test projects for Borland's C++Builder 6 and Delphi 7 are available for download.

Is 10 times 1/10 equal to 1?

The solution by summation is

  Pascal    C++  
  var i: integer;
      sum: double;

  sum = 0.0;
  for i := 1 to 100 do
  begin
    sum := sum + 0.1;
    if sum = 1.0 then
      break;
  end;
  writeln(sum);
  
  double sum = 0.0;

  for(int i=1; i<=100; i++)
  {
    sum += 0.1;
    if (sum == 1.0)
    {
      break;
    }
  }
  cout << sum << endl;

The result is sum = 9.99999999999998. On the other hand, (10.0*(0.1) = 1.0 ) is true. However you should never rely on the equality of floating-point numbers.

Calculation of machine epsilon

The obvious solution is

  Pascal    C++  
  eps := 1.0;
  while (1.0 + eps) > 1.0 do
  begin
    writeln(eps);
    eps := eps/2.0;
  end;  
  eps = 1.0;
  while ((1.0+eps)> 1.0)
  {
    cout << eps << endl;
    eps /= 2.0;
  } 

The last printed value must be the machine epsilon. However, for Intel processor regardless of the precision of eps it gives 1.08420217248550E-0019 , see the console projects macheps. This is because this small piece of code was optimized and the internal processor precision was used. The following code

  Pascal    C++  
  eps := 1.0;
  repeat
    writeln(eps);
    eps := eps/2.0;
    sum := 1.0 + eps;
  until sum <= 1.0 ;
  eps = 1.0;
  do {
    cout << eps << endl;
    eps /= 2.0;
    sum = 1.0 + eps;
  } while (sum > 1.0);

gives ε 0 = 1.19209289550781E-7 for single float and ε 0 = 2.22044604925031E-16 for double.

Round-off error example 1.

It is proven that the series

          (3)

diverges (tends to infinity as ln n). Let us check this with a computer. The summation is terminates when the sum stops changing:

  Pascal    C++  
  sum := 0.0;
  n   := 1.0;
  repeat
    sum1 := sum;
    sum := sum + 1.0/n;
    n := n + 1.0;
  until sum1 = sum;
  sum = 1.0;
  n = 1.0;
  do {
    sum1 = sum;
    sum = sum + 1.0/n;
    n += 1.0;
  } while (sum > sum1);

For the single/float numbers the summation stopped at n = 2097153 and S = 15.4036827087402. The result is much less than the infinity because 1/n < sum*ε 0 . How long will it take to run with the double precision?

Round-off error example 2.

It is proven that the series

          (4)

converges for any finite x. In this experiment the summation was terminated when a term became less than the threshold.

 Precision   x   Threshold ε   Sum   exp(x)   Valid digits 
 Single    -9.5   1.0e-10  2.13608618651051E-5   7.48518298877006E-5   0
 Double   -9.5   1.0e-15  7.48518299667056E-5   7.48518298877006E-5   9
 Double   -19.5   1.0e-15  5.54447786514606E-9   3.39826781949507E-9   0

We can see the single precision result for x = -9.5 has no correct digits. Switching to the double precision only moves the problem to larger values.

For this problem there is much better cure. Let us use the identity

 exp(-x) = 1/exp(x        (5)

The results have all possible correct digits:

 Precision   x   Threshold ε   1/Sum   exp(-x)   Valid digits 
 Single   9.5   1.0e-10  7.48518368560357E-5  7.48518298877006E-5  7
 Double  19.5  1.0E-15  3.39826781949507E-9   3.39826781949507E-9  15

This trick is discussed in the section "Error propagation".

 

References

  1. Computer Methods for Mathematical Computations, by George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler, Prentice Hall, Englewood Cliffs, New Jersey, 1977.
  2. Floating point.
  3. The Floating-Point Guide - What Every Programmer Should Know Floating-Point Arithmetic. - html
  4. The Floating-Point Guide - What Every Programmer Should Know Floating-Point Arithmetic. - pdf
  5. IEEE 754-1985.
  6. Floating point number representation.
  7. Five Tips for Floating Point Programming.

 

© Nikolai Shokhirev, 2001 - 2025

Count: