Fixed-Point Arithmetic

and Transcendental Functions

Not all microprocessors are created equally; some have math coprocessors and some do not. In fact, the Intel 80486 was the first Intel microprocessor that had a built-in math coprocessor; otherwise, up until 80386, you had to pay extra for any decent processing power. C compilers were provided with the math emulation routines, but the performance was anything but acceptable. It was not because the routines were written poorly; there was a legacy and a standard for the floating-point arithmetic.

They all had to confirm to IEEE standards for double precision arithmetic of 64-bit words; especially, the solution of transcendental functions. The $$sin, cos, tan, exp, ln $$ and so forth were all written with a high degree of accuracy, but a penalty was paid in terms of performance. Graphics routines were the hardest hit. The transformation and rotation of objects requiring matrix multiplication and additions were noticeably slow, mainly because the underlying arithmetic operations were slow.

Today, we take math coprocessors for granted, but there are still times when your embedded system design cannot afford a coprocessor and you have to come up with a fast and efficient math library. This chapter is devoted to math algorithms and solution of transcendental functions. The fixed-point arithmetic is presented as an alternative to floating-point arithmetic, and the pros and cons are discussed with examples. One more point to highlight is the magnitude of the work involved in the solution of transcendental functions, and you can make an informed decision regarding your need for a math coprocessor.

We begin our discussion with the solution of transcendental functions.

Transcendental Functions

By definition, the trigonometric and logarithm functions, including sin, cos, tan, e, and ln, and so forth, cannot be expressed in algebraic form. Just like the square root of 2 and the value of pi, these are not rational numbers, which means that there are no two integers whose ratio can be used to define them. As we know, all arithmetic operations must be brought down to the four basic operations of addition, subtraction, multiplication, and division. If a function cannot be expressed as an operation on rational numbers, it cannot be solved. Thus, the only method of computing the transcendental functions is by the expansion of the function in its polynomial form.

The Taylor series expansion is the fundamental solution to all transcendental functions.

The following is an example of the Taylor series expansion for the exponent function:

$$ e^x =1+x^1/1! +x^2/2!+x^3/3! + x^4/4! + x^5/5! \cdots+x^n/n! \\ $$

One must expand the series to its infinite terms if an exact solution of the function is required. Obviously, nobody has the time and, probably, nobody needs the precision either. So, how do you decide how many terms of the series should be used? That depends on your application and the degree of precision you need.

Fig 1.1 Floating Point Foramt Double Precision

picture.jpg

Figure 1.1 describes the format for a double-precision floating-point number. The float variable, on the other hand, is a 32-bit-long word with 1 sign bit, 8 exponent bits and 24 bits of mantissa or fraction.

Figure 1.2 describes the format for a single-precision number. Again, we gain an extra bit of storage by removing the normalized bit of mantissa and using it to store an extra precision bit.

You might think that specifying a variable as “float” instead of “double” might give you performance improvement, as there are only 32 bits in a float variable—think again! It is true that multiplying two 32-bit numbers is faster than multiplying two 64-bit numbers, but all numbers are converted to 64-bit double precision, before the polynomial operation being performed. This is double jeopardy—now there is the extra penalty of converting the numbers from single precision to double precision, and vice versa.

If your embedded system cannot afford a math coprocessor and you need floating-point arithmetic, the following is an alternate solution to the standard C library functions. The precision is not that great—in some cases, it is about four decimal places—but you will find these routines several times faster than the standard functions, and you might not need all that extra precision after all.

There is a two-pronged approach that is used in the upcoming implementation. First, using the lower degree polynomials compare to that of the standard double precision library. Second, all the basic arithmetic operations of multiply, divide, add, and subtract are done using the 32-bit integer arithmetic. First, we have selected the lower degree polynomials that reduce the amount of multiplication operations.

We will begin our discussion with the general solution of transcendental functions; in other words, polynomial expansion.

Fig 1.2 Floating Point Foramt Single Precision

picture.jpg

Reduced Coefficient Polynomials

A polynomial is a truncated form of an infinite series, suitable for computational purposes. We must stop solving the terms after achieving a certain degree of precision, as we cannot go on computing forever, as required for an infinite series. If we were to compute the value of the exp function with the help of the Taylor series expansion as explained previously, we need to evaluate the terms for at least 100 degrees before we get a result of three decimal places of accuracy—not a very practical suggestion.

What we need is to combine the coefficients beforehand in order to reduce the degree of the terms in our final analysis. B. Carlson and M. Goldstein in 1943 presented polynomial solutions for transcendental functions that are suitable for digital computer applications. Source: Handbook of Mathematical Functions, Publisher Dovers (Pg 76..)

There are generally two sets of solutions for each basic function of sin, cos, tan, arctan, exponent, and logarithm. The high-order polynomial is for the double-precision result, and the low-order polynomial is for the single-precision result. Other functions such as cosec, cot, and sec can be evaluated with the basic trigonometric identities. The following are the two sets of reduced coefficient polynomials, obtained from the respective infinite series, that are guaranteed to be convergent for the input range specified.

sin(x)

The input to this function is the angle $$x$$ in radians, whose sin is to be determined. The governing polynomial is valid only for the range $$0 < x < \pi/2$$, Thus, the operand must be reduced to the given range by applying the modulus rule for the sin function.

Low-degree polynomial for single-precision application:

$$ Input-range \cdots 0\le x\le \pi/2 \\ \\ {sin(x)\over x} = 1 + a_2x^2 + a_4x^4 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 2\times 10^{-4} \\ \\ {Constants: a_2 = -.16605, a_4=.00761} \\ \\ $$

High-degree polynomial for double-precision application:

$$ Input-range \cdots 0\le x\le \pi/2 \\ \\ {sin(x)\over x} = 1 + a_2x^2 + a_4x^4 + a_6x^6 + a_8x^8 + a_10x^10 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 2\times 10^{-9} \\ \\ {Constants: a_2 = -.16666,66664, a_4=.00833,33315} \\ \\ {Constants: a_6 = -.00019,84090, a_8=.00000,27526} \\ \\ {Constants: a_10 = -.00000,00239} \\ \\ $$

cos(x)

The input to this function is the angle $$x$$ in radians, whose cos is to be determined. The governing polynomial is valid only for the range $$0 < x < \pi/2$$, Thus, the operand must be reduced to the given range by applying the modulus rule for the cos function.

Low-degree polynomial for single-precision application:

$$ Input-range \cdots 0\le x\le \pi/2 \\ \\ {cos(x)} = 1 + a_2x^2 + a_4x^4 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 9\times 10^{-4} \\ \\ {Constants: a_2 = -.49670, a_4=.03705} \\ \\ $$

High-degree polynomial for double-precision application:

$$ Input-range \cdots 0\le x\le \pi/2 \\ \\ {sin(x)\over x} = 1 + a_2x^2 + a_4x^4 + a_6x^6 + a_8x^8 + a_10x^10 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 2\times 10^{-9} \\ \\ {Constants: a_2 = -.49999,99963, a_4=.04166,66418} \\ \\ {Constants: a_6 = -.00138,88397, a_8=.00002,47609} \\ \\ {Constants: a_10 = -.00000,02605} \\ \\ $$

tan(x)

The governing polynomial for $$tan(x)$$ is defined in terms of its counterpart $$x\times cot(x)$$. $$tan(x)$$ is obtained by the analogy $$tan(x) = 1/cot(x)$$. Again, the angle x is in radians, whose Tan is to be determined. The polynomial is valid only for the range$$0 < x < \pi/2$$ thus, the operand must be reduced to the given range by applying the modulus rule for the cotangent function.

Low-degree polynomial for single-precision application:

$$ Input-range \cdots 0 \le x \le \pi/4 \\ \\ {cot(x) = 1 + a_2x^2 + a_4x^4 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 3\times 10^{-5} \\ \\ {Constants: a_2 = -.332867, a_4=.024369} \\ \\ $$

High-degree polynomial for double-precision application:

$$ Input-range \cdots 0 \le x \le \pi/4 \\ \\ {x \times cot(x)} = 1 + a_2x^2 + a_4x^4 + a_6x^6 + a_8x^8 + a_10x^10 + \epsilon \\ \\ error-magnitude \cdots |\epsilon|\le 2\times 10^{-6} \\ \\ {Constants: a_2 = -.33333,33410, a_4=-.02222,20287} \\ \\ {Constants: a_6 = -.00211,77168, a_8=-.00020,78504} \\ \\ {Constants: a_10 = -.00002,62619} \\ \\ $$

$$e^x$$ (exponent of x)

The polynomial for $$e^x$$ is defined for values less than 0.639. The exponent of larger numbers can be computed by splitting the input floating-point value into its integer and fraction portion, and applying the following rule:

$$e^x = e^{(integer-portion + 0.693 + fraction-portion)}= e^{(integer-portion)} +2 e^{(fraction-portion)}$$

The exponent of the integer can quickly grow out of bounds. For example, the exponent of 31 is $$2.9^{1013}$$; anything above can be considered infinity.

$$ \\ {e^x} = {1 \over 1.0 + a_1x^1 + a_2x^2 + a_3x^3 + a_4x^4 + \epsilon} \\ \\ Input-range \cdots 0 \le x \le 3 \times 10^{-5} \\ $$

$$ \\ \\ error-magnitude \cdots |\epsilon|\le 3\times 10^{-5} \\ \\ {Constants: a_1 = -.99986,84, a_2 =.49829,26, } \\ {Constants: a_3 = -.15953,32, a_4 =.02936,41 } \\ \\ $$

High-degree polynomial for double-precision application:

$$ \\ Input-range \cdots 0 \le x \le 2\times 10\times10^{-10} \\ \\ {e^x} = {1 \over 1.0 + a_1x^1 + a_2x^2 + a_3x^3 + a_4x^4 + a_5x^5 + a_6x^6 + a_7x^7 + \epsilon} \\ \\ error-magnitude \cdots |\epsilon|\le 2\times 10^{-6} \\ \\ {Constants: a_1 = -.99999,99995, a_2=.49999,99206} \\ \\ {Constants: a_3 = -.16666,53019, a_4=.04165,73475} \\ \\ {Constants: a_5 = -.00830,13598, a_6=.00132,98820} \\ \\ {Constants: a_7 = -.00014,13161} \\ $$

$$ln_e(x)$$ ln Natural Logarithm

The polynomial for natural log is defined for values less than 1. The log of a number greater than 1.0 can be computed by applying the following rule. $$ln(x^n) = n ln(x)$$

Low-degree polynomial for single-precision application:

$$ Input-range \cdots 0 \le x \le 1.0 \\ \\ {ln(1+x)} = {a_1x^1 + a_2x^2 + a_3x^3 + a_4x^4 +a_5x^5 + \epsilon} \\ \\ error-magnitude \cdots |\epsilon|\le 1\times 10^{-5} \\ \\ {Constants: a_1 = .99949,556, a_2=-.49190,896} \\ \\ {Constants: a_3=.28947,478, a_4=-.13606,275, a_5=.03215,845} \\ \\ $$

High-degree polynomial for double-precision application:

$$ Input-range \cdots 0 \le x \le 1.0 \\ \\ {ln(1+x)} = {a_1x^1 + a_2x^2 + a_3x^3 + a_4x^4 +a_5x^5 + a_6x^6 +a_7x^7 + a_8x^8 + \epsilon} \\ \\ error-magnitude \cdots |\epsilon|\le 3\times 10^{-6} \\ \\ {Constants: a_1 = .99999,64239, a_2=-.49987,41238} \\ \\ {Constants: a_3 = .33179,90258, a_4=-.24073,38084} \\ \\ {Constants: a_5 = .16765,40711, a_6=-.09532,93897} \\ \\ {Constants: a_7 = .03608,84937, a_8=-.00645,35442} \\ \\ $$

The preceding two sets of polynomials offer different precision results for the function they represent. The high-degree polynomial offers higher precision, but requires a greater number of computations; just imagine computing x to the power 8, 7, 6, 5, 4, 3, 2, and so on. The low-degree polynomial only requires computing maximum x to the power of 5—a magnitude of difference in computational speed. The library functions developed in this chapter were designed around the low-degree polynomials.

Now, let us discuss the inner-workings of the computations. It is not enough to reduce the degree of the polynomial, as the bulk of CPU time is actually spent performing the four basic arithmetic operations of multiply, divide, add, and subtract. If we were to use the default library functions for the basic operations, we essentially do not gain anything. The fundamental library routines will convert the float into double, perform the operation in double, convert the double back into the float, and then return. These conversions are inherent in the library and cannot be avoided. All these unnecessary conversions defeat the purpose of using the float format. The following discussion explores other possibilities of real number representation that are more efficient than the default floating-point format.

Fixed-Point Arithmetic and Solution of Transcendental Functions

Is there an alternative to the floating-point arithmetic? We need floating point to represent the huge range of numbers we deal with. Putting –127 through + 127 exponent combined with the 23 bits of precision digits can only be done in floating-point format. The other alternative is the fixed-point format. A 32-bit number can be thought of as carrying 8 bits of integer and 24 bits of fraction, and a 64-bit number can be divided into 32 bits of integer and 32 bits of fraction. However, we then limit the numbers to a very narrow range.

The maximum integer value in an 8-bit integer is 255, and the maximum precision of the fraction in a 24-bit mantissa is 1 in 16 million—nearly .000001. All rational numbers are comprised of an integer portion and a fraction portion. The decimal point is placed between them to distinguish the end of the fraction from the beginning of the integer.

The positional numbering system base $$b$$ defines a real number as follows:

$$ a_nb^n + .. + a_2b^2 + a_1b^1 + a_0b^0 + a_{-1}b^{-1} + a_{-2}b^{-2} + a_{-3}b^{-3} .. a_{-n}b^{-n} $$

For example, the decimal number 34.61 equals:

$$ 34.61 = 0\times 10^n + \cdots 3 \times 10^1 + 4 \times 10^0 + 6 \times 10^{-1} + 1 \times 10^{-2} + \cdots 0 \times 10^{-n} $$

Computers have a fixed word-length limit and are optimized for fixed word calculations. Each word can be thought of as an integer, or it could be treated as a fixed-point value with a predefined integer portion and a predefined fraction portion. The decimal point is only a perception. The computational method is the same for both, the integer as well as the fixed-point values. The only difference is the way in which the result is accumulated from the different sources where the CPU has placed the result after an arithmetic operation.


For example, a 32-bit number when multiplied by another 32-bit number produces a 64-bit result. The 80386 CPU utilizes the EAX:EDX register combination to store the result. The EDX register contains the most significant 32 bits, while the EAX register contains the least significant 32 bits. In the context of a pure 32-bit integer operation, we ignore the result of the upper 32 bits and keep only the lower 32 bits that is in the EAX register. However, if we think of the 32-bit value as an 8-bit integer and a 24-bit fraction, the result must be obtained by combining the upper 8 bits of the EAX and the lower 24 bits of the EDX to bring the result back into 32-bit word format (Figure 1.3).

Fig. 1.3 Fixed Point Multiplication

picture.jpg

The division operation for the two 32-bit fixed-point dividend and divisor is achieved by combining the lower 8 bits of EDX and the upper 24 bits of EAX as illustrated in Figure 1.4. Notice that the 32-bit fixed-point addition and subtraction are identical to the integer addition and subtraction. The following example demonstrates the multiplication and division operation on two 32-bit fixed-point numbers with 8 bits of integer and 24 bits of fraction.

All is well and good if we could only guarantee that the result will not exceed 255, since this is the largest number we can put in a 32-bit fixed-point format of 8-bit integer and 24-bit fraction. The fixed-point arithmetic is definitely far superior than the floating point, and allows us to perform arithmetic on real numbers with fractions—albeit, the range of numbers is very limited. Successive multiplication of two integer numbers will soon overflow the result area; moreover, successive addition might cause the carryover that has no place in the result storage. On the other hand, there is never a worry about the fraction portion being lost in the intermediate operations; at the most, we might lose some precision, which is acceptable since it happens to the floating-point format also. If it were not for the fear of producing erroneous results in the intermediate stages, we would definitely gain considerable performance improvement by using the integer instruction set of the CPU architecture for our basic calculations on the real numbers.

Let us take another look at the arithmetic operations on the polynomial functions discussed previously and the restrictions we have on the range of the coefficients and the input values that are involved in the operation. Let us analyze all the arithmetic operations that are being required to solve a given polynomial and see if we can apply the fixed point arithmetic instructions on the parameters.

There is an input value x (always less than 1.0) that must be raised to its power several times; that is, successive multiplications (but no chance of result overflow). The result must be multiplied by the coefficients—another multiply (all coefficients are fractions only, so no overflow of the result here either)— there is an addition or subtraction of all the terms (but no chance of exceeding the maximum value, as the terms are all less than 1.0), and you have the result of the polynomial operation.

The preceding argument is presented to express the fact that we can safely compute the polynomial solutions using the fixed-point format alone, and that is the format being used to develop the library of functions in the next section. The ideal implementation of the fixed-point format would be to use the native CPU instructions of the 32-bit integer add, subtract, multiply, and divide. However, that would require an interface with the Assembly language routines, which would cause portability problems across different platforms. Instead, the basic routines are simulated in the C language functions using the integer operations of the C compiler. Interested readers could rewrite the functions in Assembly language to gain further performance improvement and optimize the speed even more.

Fig. 1.4 Fixed Point Division

picture.jpg

A Fast and Efficient Algorithm for the Square Root Function

Another example of a fixed-point implementation is a solution of the square root function. The algorithm is based on the fact that a floating-point number can easily be split into its exponent and fraction portion, and each can be treated as a separate, fixed-point quantity. Think about a floating-point number in terms of its scientific notation with the exponent portion multiplied by a fraction portion. Using the multiplicative property of the square root function, finding the square of a number is simply the square root of the exponent portion multiplied with the square root of the fraction portion, as shown here:

$$ {\sqrt {2^x \times fraction}=\sqrt {2^x} \times \sqrt {fraction}} \\ $$

For example, the square root of the number 3.0 is equivalent to:

$$ {\sqrt {3}=\sqrt {2^2} \times \sqrt {0.75}} \\ $$

$$ {\sqrt {3}={2^1} \times \sqrt {0.75}} \\ $$

The sqrt of the exponent portion is obtained by shifting the exponent to the right 1 bit; in other words, dividing it by 2.

The fraction quantity is always less than 1.0 by definition, so we can apply the same fixed-point technique we developed in the previous section of transcendental functions.

The sqrt of a 32-bit fraction is obtained by a binary search through maximum 16 iterations of 16-bit by 16-bit multiplication of an approximation. At the end, the individual results are simply combined into floating-point format and returned.

There are two steps in this algorithm:

1. sqrt of exponent = exponent shift right 1

2. sqrt of fraction = $$X$$, if ($$X \times X = fraction$$)

See lines 430through 555in Code Listing 1.1 for an implementation of the algorithm.


            /************* Transcendental Functions ***********
The sin, cos, tan, exp, ln and square root function
solution are presented using fixed point arithmetic
and reduced coefficient polynomials.
You can use the file as an include or compile into
a library for linking with other modules. The names
of the functions are appended with an underscore such
as __cos so that it does not conflict with the deafulat
library routines.
The followings are the exported functions

double __sin(double x);
double __cos(double x);
double __tan(double x);
double __exp(double x);
double __lge(double x);
double __lgd(double x);
double __sqrt(double x);

**************************************************/
#include <stdio.h>
#include <math.h>
#define FLOATDIGITS 6
#define MAXINTEGER 2.147483647e+9
#define debug
#define test
#define fixed
#define TWO_POINT_ZERO 2.0
#define TWO_PI_CONST 6.283185
#define PI_CONST 3.141593
#define PI_2_CONST 1.570796
#define PI_4_CONST 0.7853982
#define ONE_POINT_ZERO 1.0
#define CONST_POINT_693 0.693
#define CONST_LN_2 0.69314718
#define CONST_LN_10 2.302585093
#define TWO_POINT_ZERO 2.0
#ifndef INFINITY
#define INFINITY 1.0E99
#endif
#ifndef NAN
#define NAN 0.5E99
#endif
#define modulous(r1,r2) (r1- (((int) (r1/r2))* r2))
/* float xtemp,ytemp; */
/********************************************
The coeffiecients of the polynomials are converted
into fixed point format as the polynomial solution
requires the parameters to be presented in a fixed-
point format
*********************************************/

#ifdef fixed
long SIN_COEF_TBL[5] = {0x1000000,0,0xffd57dc0,0,0x1f2ba};
long COS_COEF_TBL[5] = {0x1000000,0,0xff80d845,0,0x97c1b};
long COT_COEF_TBL[5] = {0x1000000,0,0xffaac93b,0,0xfff9c2f5};
long ATAN_COEF_TBL[10] = {0,0xfff738,0,0xffab717e,0,0x2e1db8,0,0xffea34ba,
0,0x55572};
long LN_COEF_TBL[6] = {0,0xffdef1,0xff821241,0x4a1b05,0xffdd2afe,0x83b89};
long EXP_COEF_TBL[5] = {0x1000000,0xff0008a0,0x7f901a,0xffd728d5,0x78467};
#else
float SIN_COEF_TBL[5] = {1.0,0.0,-0.16605,0.0,0.00761};
float COS_COEF_TBL[5] = {1.0,0.0,-0.49670,0.0,0.03705};
float COT_COEF_TBL[5] = {1.0,0.0,-0.332867,0.0,-0.024369};
float ATAN_COEF_TBL[10] = {0.0,0.9998660,0.0,-0.3302995,0.0,
0.1801410,0.0,-0.0851330,0.0,0.0208351};
float LN_COEF_TBL[6] = {0.0,0.99949556,-0.49190896,0.28947478,
-0.13606275,0.03215845};
float EXP_COEF_TBL[5] = {1.0,-0.9998684,0.4982926,-0.1595332,0.0293641};
#endif
/**********************************************************
Exponent is calculated as exponent of integer + exponent
of fraction. Exponent of up to 32 integers are sufficient
for all practical purpose, Beyond that it might as well
be infinity.

A lookup table of exponent of integers is presented below
**********************************************************/
float INT_EXP_TBL[33] = {
1.000000000,
2.718281828,
7.389056099,
20.08553692,
54.59815003,
148.4131591,
403.4287935,
1096.633158,
2980.957987,
8103.083927,
22026.46580,
59874.14172,
162754.7914,
442413.3920,
1202604.284,
3269017.373,
8886110.521,
24154952.75,
65659969.14,
178482301.0,
485165195.4,
1318815735.0,
3584912846.0,
9744803446.0,
2.6489122E10,
7.2004899E10,
1.9572960E11,
5.3204824E11,
1.4462570E12,
3.9313342E12,
1.0686474E13,
2.9048849E13,
1.0E38};
/****************** function define ****************/
double __sin();
double __cos();
double __tan();
double __exp();
double __lge();
double __lgd();
double __atan();
double fx2fl();
long fl2fx();
long fx_mpy();
/******************************************************
general purpose conversion routines
*******************************************************/

/***************floating point to fixed point *********/
long fl2fx(double x)
{
long *k,mantissa,exp;
float temp;
temp = (float) x;

/* get a pointer to fl pt number */
k = (long *) & temp;

/* remove sign bit from the number */
mantissa =(*k & 0x7fffffff);

/* extract exponent from the mantissa */
exp = ((mantissa >> 23) & 0xff);
if (!exp)
return (0);
/* mantissa portion with normalized bit */
mantissa = ((mantissa & 0xffffff) | 0x800000);

/* inxrease exp for bringing in normalize bit */
exp++;
if (exp > 0x7f)
{
/* exp > 0x7f indicates integer portion exist */
exp = exp - 0x7f;
mantissa = mantissa << exp;
}
else
{
/* exp <= 0x7f indicates no integer portion */
exp = 0x7f - exp;
mantissa = mantissa >> exp;
}
if (*k >= 0)
return(mantissa);
else
return (-mantissa);
}

/************* fixed point to floating point *********/
double fx2fl(long j)
{
long i,exp;
float temp,*fp;
i = j;
if (i == 0)
return ((double)(i));
/*
maximum exp value for an 8-bit int and 24-bit fraction
fixed point
biasing factor = 7f;
7f + 8 = 87;
*/
exp = 0x87;
if (i < 0)
{
exp = exp | 0x187;
i = -i;
}
/* normalize the mantissa */
do
{
i = i << 1; exp --;
}
while (i > 0);
i = i << 1; exp --;

/* place mantissa on bit-0 to bit-23 */
i = ((i >> 9) & 0x7fffff);
exp = exp << 23;
i = i | exp;
fp = (float *) & i;
temp = *fp;
return ((double) temp);
}

/*************** fixed point multiply ****************/
long fx_mpy(long x,long y)
{
unsigned long xlo,xhi,ylo,yhi,a,b,c,d;
long sign;

/* The result sign is + if both sign are same */

sign = x ^ y;
if (x < 0)
x = -x;
if (y < 0)
y = -y;

/*
two 32 bit numbers are multiplied as
if there are four 16-bit words
*/
xlo = x & 0xffff;
xhi = x >> 16;
ylo = y & 0xffff;
yhi = y >> 16;
a = xlo * ylo;
b = xhi * ylo;
c = xlo * yhi;
d = xhi * yhi;
a = a >> 16;

/* add all partial results */
a = a+b+c;
a = a >> 8;
d = d << 8; a = a+d;
if (sign < 0)
a = -a;
return(a);
}

/*************** fixed point polynomial ****************/
/*
Three input parameters; all fixed point
input x;
list of coefficients
number of coefficients in the list
*/
double polynomial_x(double r1,long * r2,int i)
{
long j,k,temp,sum;
k = fl2fx(r1);
sum = 0;

/* steps
find power of x,
multiply with coefficient
add all terms
*/
while(i--)
{
temp = 0x1000000; /* this is 1.0 */
if (r2[i] != 0)
{
/* power of x */
for (j=i;j>0;j--)
temp = fx_mpy(temp,k);

/* and then multiply with coefficient */
temp = fx_mpy(temp,r2[i]);
}
else
temp = 0;
sum = sum + temp;
};
return(fx2fl(sum));
}
/********************* sin ************************/
double __sin(double x)
{
float temp;
int quadrent;
unsigned long *k;
temp = (float) x;
/* make absolute value */
k = (unsigned long *) &temp;
*k = (*k & 0x7fffffff);

/* 360 degree modulus */
if (temp > TWO_PI_CONST)
{
temp = modulous (temp,TWO_PI_CONST);
}

/* negative angle are made complements of 360 degree */
if ( x < 0)
temp = TWO_PI_CONST - temp;
quadrent = (int) (temp / PI_2_CONST);
temp = modulous(temp,PI_CONST);

if (temp > PI_2_CONST)
temp = (PI_CONST - temp);

temp = (polynomial_x(temp,SIN_COEF_TBL,5) * temp);

if (quadrent >= 2)
temp = -temp;
return (temp);
}

/********************* cos ************************/
double __cos(double x)
{
float temp;
int quadrent;
unsigned long *k;
temp = (float) x;

/* make absolute value */
k = (unsigned long *) &temp; *k = (*k & 0x7fffffff);

/* 360 degree modulus */
if (temp > TWO_PI_CONST)
{
temp = modulous (temp,TWO_PI_CONST);
}

/* negative angle are made complements of 360 degree */
if ( x < 0)
temp = TWO_PI_CONST - temp;

/* find out the quadrant from the original angle */
quadrent = (int) (temp / PI_2_CONST);

temp = modulous(temp,PI_CONST);
if (temp > PI_2_CONST)
temp = (PI_CONST - temp);

temp = (PI_2_CONST - temp );

temp = (polynomial_x(temp,SIN_COEF_TBL,5) * temp);

if ((quadrent ==1) || (quadrent == 2)) temp = -temp;
return ((float) (temp));
}

/********************* tan ************************/
double __tan(double x)
{
float temp;
int quadrent;

unsigned long *k;
temp = (float) x;

/* make absolute value */
k = (unsigned long *) &temp;
*k = (*k & 0x7fffffff);

/* 360 degree modulus */
if (temp > TWO_PI_CONST)
{
temp = modulous (temp,TWO_PI_CONST);
}

/* negative angle are made complements of 360 degree */
if ( x < 0)
temp = TWO_PI_CONST - temp;

quadrent = (int) (temp / PI_2_CONST);
temp = modulous(temp,PI_CONST);

if (temp > PI_2_CONST)
temp = (PI_CONST - temp);

if (temp < PI_4_CONST)
{
temp=((ONE_POINT_ZERO /
polynomial_x(temp,COT_COEF_TBL,5)) * temp);
}
else
{
temp = PI_2_CONST - temp;
temp = ONE_POINT_ZERO /
((ONE_POINT_ZERO /polynomial_x(temp,COT_COEF_TBL,5)) * temp);
}
if ((quadrent == 1)||(quadrent==3))
temp = -temp;
return (temp);
}

/********************* exp ************************/
/* e pwr x = 2 pwr (x * log2_e) */
/* if |X| <= 0.693.. then exp(x) = exp(fraction) */
/* else exp(x) = exp(INT + 0.693 + FRACTION) */
/* exp(0.693) = 2;*/
double __exp(double x)
{
float temp = x;
int i;
unsigned long *k;
if (x > 0.0)
i= ((int) (x + 0.0000001));
else
i= ((int) (x - 0.0000001));

if (i < 0)
i = 0-i;
if (i > 31)
{
if (x > 0)
return(INT_EXP_TBL[32]);
else
return(0);
}
temp = (float) x;

/* make absolute value */
k = (unsigned long *) &temp;
*k = (*k & 0x7fffffff);
temp = modulous(temp,ONE_POINT_ZERO);
if (temp > CONST_POINT_693)
{
temp = temp - CONST_POINT_693;
temp = ONE_POINT_ZERO /
(polynomial_x(temp,EXP_COEF_TBL,5));
temp = temp * TWO_POINT_ZERO;
}
else
{
temp = ONE_POINT_ZERO /
(polynomial_x(temp,EXP_COEF_TBL,5));
}
temp = (INT_EXP_TBL[i]) * temp;

if (x < 0)
temp = ONE_POINT_ZERO / temp;
return (temp);
}
/********************* ln (x)************************/
/* log base E (X) = ln (x)/ln (E) */
/* ln (x) = pwr * ln(2) + ln(fraction+1;*/
/* range =(0 <= x <= 1)*/

double __lge(double x)
{
static long exp,i,*k;
static float temp,xtemp,*fraction;

xtemp = (float) x;

temp=0;
k =(long *) &xtemp; i=k[0];

/* do not compute ln of a minus number */
if (i <= 0)
return (NAN); /* NAN could be replaced by 0.0 */

/*remove exp and sign and check for fraction*/
if ((i =(i & 0x7fffff)) != 0)
{
exp = 0x7f;
while ((i & 0x800000) == 0)
{
i = i << 1; exp--;
}

i = i & 0x7fffff; /* remove normalized bit */
exp = exp <<23; /* new exp for fraction */
i = i | exp; /* combine exp and mantissa */

fraction = (float *) &i;
temp = fraction[0];

temp = (polynomial_x(temp,LN_COEF_TBL,6));
}

/* from input value x extract the exp */
exp = *k;
exp = exp >> 23;
exp = exp - 0x7f;

temp = temp + (exp * CONST_LN_2);
return(temp);
}
/********************* lgd ************************/
double __lgd(double x)
{
return (__lge(x)/CONST_LN_10);
}

/********************* square root **********************
A fast and efficient algorithm for Square Root function.
The algorithm is based on the fact that a floating point number
can be easily split into its exponent and fraction portion and
each can be treated as separate integer quantity. The sqrt of
exponent portion is obtained by shifting the exponent right 1-
bit, i.e. dividing it by 2 and the sqrt of a 32-bit fraction is
done by a binary search of 65536 possible values or 16 trials of
16-bit x16-bit multiplication of an approximation. At the end,
the individual results are simply combined Into floating point
format and returned.

sqrt of exponent = exponent shift right 1 ........ (1)
sqrt of fraction = X, if (X x X = fraction) ........ (2)
**********************************************************/
double __sqrt(double x)
{
float xtemp;
unsigned long exp,i,j,*k;
unsigned short int new_apprx,old_apprx;

xtemp = (float) x;

/*
step 1------ Think of the floating point number as an
integer quantity
*/
k =(unsigned long *) &xtemp; i=k[0];
/*
step 2------ do not compute sqrt of a minus number
*/
if (x <= 0.0)
{
if (x==0.0)
return(x);
else
return (NAN); /* NAN could be replaced by 0.0 */
}
/*
step 3--- extract exp and place it in a byte location
*/
exp = (i & 0x7f800000);
exp >>= 23;
/*
step 4 --- bring fraction portion to 32-bit position and
bring normalized bit
*/
i <<= 8;
i = i | 0x80000000;
/*
step 5 --- inc exp for bringing in normal bit and
remove exp bias
*/
exp += 1;
exp -= 0x7f;
/*
step 6 --- compute square root of exp by shift
right or divide by 2
if exp is odd then shift right mantissa
one more time
*/
if ( exp & 1)
{
i >>= 1;
exp >>= 1;
exp += 1;
}
else
exp >>= 1;

/*
step 7--- add bias to exp
*/
exp += 0x7f;
/*-----------------sqrt of fraction ---------------*/
/*
step 8 -----start with 0x8000 as the first apprx
*/
j = old_apprx = new_apprx = 0x8000;

/*
step 9 ---use binary search to find the match.
loop count maximum 16
*/
do
{
/*
step 9a ---- multiply approximate value to itself
*/
j = j*j;
/*
step 9b ---- next value to be added in the binary
search
*/
old_apprx >>= 1;
/*
step 9c ---- terminate loop if match found
*/
if (i == j )
old_apprx = 0;
else
{
/*
step 9d --- if approx. exceeded the real fraction
then lower the approximate
*/
if (j > i )
new_apprx -= old_apprx;
/*
step 9e --- else increase the the approximate
*/
else
new_apprx += old_apprx;
}
j =new_apprx;
} while (old_apprx != 0);

/*
step 10 --- combine exp and mantissa
*/
j =new_apprx;
/*
step 10a --- bring mantissa to bit position 0..23
and remove Normal bit
*/
j <<= 8;
j &= 0x7fffff;
/*
step 10b --- decrement exponent for removing Normalized
bit
*/
exp -= 1;
/*
step 10c --- bring exponent to position 23..30
*/
exp <<= 23;
/*
step 10d --- combine exponent and mantissa and
return double
*/
j = j | exp;
xtemp = * ((float *) &j);
return (xtemp);
}








Summary

In this article, we developed an efficient transcendental math emulation library suitable for low-cost embedded systems. If the emphasis is on speed rather than precision, then, in the absence of hardware math-coprocessors, the alternative is fixed-point arithmetic. There are several applications, such as graphics routines, that can take advantage of the high-speed solution offered by the library presented in this chapter. A unique solution to the square root function was also presented for floating-point numbers that is several orders of magnitude faster than a comparable library routine.