A common expression is used for internal representation of integers by most
computers. In contrast, there are several specifications for the representation
of floating-point numbers. REALLIB is a utility for transferring data
between computers that do not share the same specification for floating-point
numbers.
Generally, real-number-type variables are represented as floating-point
numbers. Thus, in ¦Â-base representation, a real number is expressed as:
¡Þ(0.f1f2f3¡¦¡¦¡¦fm)¦Â¡ß¦Â¡ÞE | (1) |
¡Þ(0.f1f2f3¡¦¡¦¡¦fm) = ¡Þ(f1¡ß¦Â-1 + f2¡ß¦Â-2 +f3¡ß¦Â-3 +¡¦¡¦¡¦fm¡ß¦Â-m) | (2) |
The floating- point representation used in most mainframe systems is the IBM standard. However, most compilers such as for UNIX adopt the IEEE (read as "I triple E") standard. Although both standards use 32-bit words, they do not share a common numbering system. Thus, there are differences in the range and precision of real number that can be represented by the two standards.
This is a hexadecimal floating-point representation implemented in the so-called IBM-compatible general purpose machines. These computers are designed for data processing in office works, and so they basically perform hexadecimal calculations. Below is the allocations of the bits.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ¡Þ E+64 f1 f2 f3 f4 f5 f6
Here, to prevent the exponential part from taking a negative value, it is represented by the "elevated" method with 64 added to E. (-64 <E <63). The fixed-point part is a hexadecimal 6-digit number with the 7th digit rounded down. Since this method is inconvenient for representing 0, 0 is separately defined as a "numerical value having 0 for all of its bits."
The maximum non-zero absolute value that can be represented by this method is
(1-16-6)¡ß1663 = 7.23700¡ß1075 (3)
and the minimum value is
16-1¡ß16-64 = 5.397605¡ß10-79 (4)
Furthermore, the minimum relative error of the numerical value expressed by this method when f1=f2=¡¦¡¦¡¦=f6=15 is
Approximately 16-6 ≈ 6¡ß10-8 (5)
A value will have the maximum error when f1=1 of
Approximately 16-5 ≈ 10-6 (6)
When f1 is small, as in this case, the high-order bits are wasted and results in extremely bad precision.
This is the binary floating-point representation implemented in computers such as UNIX. In recent years, more computers have been adopting this method. Compared to the IBM standard, the IEEE standard features a higher relative precision. Below is the allocations of the bits.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ¡Þ E+126 f2¡Äf4 f5¡Äf8 f9¡Äf12 f13¡Äf16 f17¡Äf20 f21¡Äf24
Here, the exponent part E is expressed by the "elevated" method in which 126 is added to E (-125 < E <128). The fixed-point part is a 24-digit binary number and f1 always has the value of 1, so it is omitted. However, when E = -126 (in which the exponent part after elevation is 0), smaller values are represented with the assumption that that f1=0. Of course, in this case, the precision is not as high as for the 24-digit binary representation. Furthermore, E = 129 (in which the exponent part after elevation is 255) is used to represent special numbers such as infinity, and cannot be used in floating-point calculations. For 0, all bits are 0.
Of the normal numbers (f1 ¡â0) that can be represented by this method, the one having the maximum absolute value is:
(1-2-24)¡ß2128 = 3.40282347¡ß1038 (7)
and one having the minimum absolute value is:
2-1¡ß2-125 = 1.17549435¡ß10-38 (8)
When non-normal numbers (f1=0) are included, the one having the minimum absolute value is:
2-24¡ß2-125 = 1.40129846¡ß10-45 (9)
Furthermore, the relative error of normal numbers that can be represented by this method is smallest when f1=f2=¡¦¡¦¡¦=f24=1, and is:
Ìó2-25 ≈ 3¡ß10-8 (10)
It is largest when f1=1, f2=¡¦¡¦¡¦= f6 = 0, and is:
Ìó2-24 ≈ 6¡ß10-8 (11)
In this case, the 25th digit is rounded off in binary when rounding floating-point numbers. However, note that in the IEEE standard, other methods of rounding are allowed. If, for example, when numbers are rounded down instead of rounding off, the error will be twice as large. Of course, this does not apply to non-normal numbers.
Below is a list of representation methods used by computers around us.
Computer | OS | Compiler | Floating-point representation method | Notes |
FACOM | MSP(General purpose) | FORT77EX | IBM | |
FACOM | XMP(UNIX) | FORT77EX | IBM | |
HITAC | VOS3(General purpose) | ?? | IBM | |
HITAC | HIUXM(UNIX) | f77 | IBM | |
SUN | UNIX | SUN FORTRAN | IEEE | |
PC9801 | MS-DOS | F77L(Lahey) | IEEE | |
PC9801 | MS-DOS | BASIC | Others* |
Note (*): this is a binary representation that resembles IEEE, but it has a different elevation value for the exponent part and a different position for the sign bit.