Saturday, May 2, 2009
Representation of Floating point numbers in a microprocessor
A floating point number can be represented in any of the following formats:
a)Single Precision (32 bits)
b)Double precision (64 bits)
c)Double Extended precision (80 bits)
Any floating point number is of the form = mantissa * 2 ^ exponent
Single precision uses 32 bits:
1 bit (bit 31) is for the sign.
8 bits (bits 30:23) is for the exponent. Note that the exponent is biased.
23 bits (bits 22:0) is for the mantissa.
Double precision uses 64 bits:
1 bit (bit 63) is for the sign.
11 bits (bits 62:52) is for the exponent. Note that the exponent is biased.
52 bits (bits 51:0) is for the mantissa.
Double-Extended precision uses 80 bits:
1 bit (bit 79) is for the sign.
15 bits (bits 78:64) is for the exponent. Note that the exponent is biased.
64 bits (bits 63:0) is for the mantissa.
How is a number like 10.25 represented in single precision format?
10.25 (base 10) = 1010.01 (base 2)
The first step is to normalize the binary number. Normalization is to represent the number of the form 1.xxxxxx.
1010.01 = 1.01001 * 2 ^ 3
The above representation is of the form mantissa * 2 ^ exponent.
Mantissa = 1.01001
Exponent = 3
This is the information that is saved in the floating point registers inside the microprocessor. The thing to note is that since the numbers are always normalized (ie; 1.xxxxx) the '1' in the integer portion of the mantissa is not saved explicitly but is implied. In the case of the exponent, instead of saving the 'real' exponent, a biased exponent is saved.
Biased exponent = Real exponent + 127 (for single precision)
For the above example, biased exponent = 3 + 127 = 130.
So we now have,
Mantissa = 01001 (ommiting the 1 and the decimal point)
Exponent=1000 0010 (130 in binary)
Sign bit = 0 (positive number)
Constructing the 32-bit register we have:
bit 31 sign -> 0
bits 30:23 exponent -> 1000 0010
bits 22:0 mantissa -> 01001000000000000000000
So the register would read the following:
0100 0001 0010 0100 0000 0000 0000 0000
In hex:
0x41240000
The same concept described above can be extended to double precision and double extended precision formats. Note that in double-extended precision format, the integer part (that is implied in single/double precision) is included and is specified explicitly in bit 63.
Most engineers use floating point calculators to see how the number is represented in different formats. I recommend the IEEE floating point calculator here:
http://babbage.cs.qc.edu/IEEE-754/Decimal.html
a)Single Precision (32 bits)
b)Double precision (64 bits)
c)Double Extended precision (80 bits)
Any floating point number is of the form = mantissa * 2 ^ exponent
Single precision uses 32 bits:
1 bit (bit 31) is for the sign.
8 bits (bits 30:23) is for the exponent. Note that the exponent is biased.
23 bits (bits 22:0) is for the mantissa.
Double precision uses 64 bits:
1 bit (bit 63) is for the sign.
11 bits (bits 62:52) is for the exponent. Note that the exponent is biased.
52 bits (bits 51:0) is for the mantissa.
Double-Extended precision uses 80 bits:
1 bit (bit 79) is for the sign.
15 bits (bits 78:64) is for the exponent. Note that the exponent is biased.
64 bits (bits 63:0) is for the mantissa.
How is a number like 10.25 represented in single precision format?
10.25 (base 10) = 1010.01 (base 2)
The first step is to normalize the binary number. Normalization is to represent the number of the form 1.xxxxxx.
1010.01 = 1.01001 * 2 ^ 3
The above representation is of the form mantissa * 2 ^ exponent.
Mantissa = 1.01001
Exponent = 3
This is the information that is saved in the floating point registers inside the microprocessor. The thing to note is that since the numbers are always normalized (ie; 1.xxxxx) the '1' in the integer portion of the mantissa is not saved explicitly but is implied. In the case of the exponent, instead of saving the 'real' exponent, a biased exponent is saved.
Biased exponent = Real exponent + 127 (for single precision)
For the above example, biased exponent = 3 + 127 = 130.
So we now have,
Mantissa = 01001 (ommiting the 1 and the decimal point)
Exponent=1000 0010 (130 in binary)
Sign bit = 0 (positive number)
Constructing the 32-bit register we have:
bit 31 sign -> 0
bits 30:23 exponent -> 1000 0010
bits 22:0 mantissa -> 01001000000000000000000
So the register would read the following:
0100 0001 0010 0100 0000 0000 0000 0000
In hex:
0x41240000
The same concept described above can be extended to double precision and double extended precision formats. Note that in double-extended precision format, the integer part (that is implied in single/double precision) is included and is specified explicitly in bit 63.
Most engineers use floating point calculators to see how the number is represented in different formats. I recommend the IEEE floating point calculator here:
http://babbage.cs.qc.edu/IEEE-754/Decimal.html
Labels: Double Extended Precision, Double Precision, Floating point, Single Precision
Subscribe to Posts [Atom]
Post a Comment