1.3 Internal Representation of Floating-Type Values

A common expression is used for internal representation of integers by most computers. In contrast, there are several specifications for the representation of floating-point numbers. REALLIB is a utility for transferring data between computers that do not share the same specification for floating-point numbers.

Generally, real-number-type variables are represented as floating-point numbers. Thus, in 汕-base representation, a real number is expressed as:

﹢(0.f₁f₂f_3’’’f_m)_汕﹣汕^﹢E

(1)

Here,

﹢(0.f₁f₂f₃_’’’f_m) = ﹢(f₁﹣汕^-1+ f₂﹣汕^-2+f₃﹣汕^-3+_’’’f_m﹣汕^-m)

(2)

corresponds to the fixed-point part, and f_i are integers with values of 0 to 汕-1, where f₁ ﹦0. E is the exponent part and is either 0 or a positive integer.

The floating- point representation used in most mainframe systems is the IBM standard. However, most compilers such as for UNIX adopt the IEEE (read as "I triple E") standard. Although both standards use 32-bit words, they do not share a common numbering system. Thus, there are differences in the range and precision of real number that can be represented by the two standards.

IBM Standard

This is a hexadecimal floating-point representation implemented in the so-called IBM-compatible general purpose machines. These computers are designed for data processing in office works, and so they basically perform hexadecimal calculations. Below is the allocations of the bits.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

﹢ E+64 f₁ f₂ f₃ f₄ f₅ f₆

Here, to prevent the exponential part from taking a negative value, it is represented by the "elevated" method with 64 added to E. (-64 <E <63). The fixed-point part is a hexadecimal 6-digit number with the 7th digit rounded down. Since this method is inconvenient for representing 0, 0 is separately defined as a "numerical value having 0 for all of its bits."

The maximum non-zero absolute value that can be represented by this method is

(1-16^-6)﹣16⁶³= 7.23700﹣10⁷⁵ (3)

and the minimum value is

16^-1﹣16^-64= 5.397605﹣10^-79 (4)

Furthermore, the minimum relative error of the numerical value expressed by this method when f₁=f₂=’’’=f₆=15 is

Approximately 16^{-6 ≈} 6﹣10^-8 (5)

A value will have the maximum error when f₁=1 of

Approximately 16^{-5 ≈} 10^-6 (6)

When f₁ is small, as in this case, the high-order bits are wasted and results in extremely bad precision.

IEEE Standard

This is the binary floating-point representation implemented in computers such as UNIX. In recent years, more computers have been adopting this method. Compared to the IBM standard, the IEEE standard features a higher relative precision. Below is the allocations of the bits.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

﹢ E+126 f₂＿f₄ f₅＿f₈ f₉＿f₁₂ f₁₃＿f₁₆ f₁₇＿f₂₀ f₂₁＿f₂₄

Here, the exponent part E is expressed by the "elevated" method in which 126 is added to E (-125 < E <128). The fixed-point part is a 24-digit binary number and f₁ always has the value of 1, so it is omitted. However, when E = -126 (in which the exponent part after elevation is 0), smaller values are represented with the assumption that that f₁=0. Of course, in this case, the precision is not as high as for the 24-digit binary representation. Furthermore, E = 129 (in which the exponent part after elevation is 255) is used to represent special numbers such as infinity, and cannot be used in floating-point calculations. For 0, all bits are 0.

Of the normal numbers (f₁ ﹦0) that can be represented by this method, the one having the maximum absolute value is:

(1-2^-24)﹣2¹²⁸= 3.40282347﹣10³⁸ (7)

and one having the minimum absolute value is:

2^-1﹣2^-125= 1.17549435﹣10^-38 (8)

When non-normal numbers (f₁=0) are included, the one having the minimum absolute value is:

2^-24﹣2^-125= 1.40129846﹣10^-45 (9)

Furthermore, the relative error of normal numbers that can be represented by this method is smallest when f₁=f₂=’’’=f₂₄=1, and is:

沶2^{-25 ≈} 3﹣10^-8 (10)

It is largest when f₁=1, f₂=’’’= f₆= 0, and is:

沶2^{-24 ≈} 6﹣10^-8 (11)

In this case, the 25th digit is rounded off in binary when rounding floating-point numbers. However, note that in the IEEE standard, other methods of rounding are allowed. If, for example, when numbers are rounded down instead of rounding off, the error will be twice as large. Of course, this does not apply to non-normal numbers.

It can be seen that even when the same number of bits are used to represent a number, the range and precision of the floating-point number that can be represented are different for different systems. In the Dennou library, these system-dependent constants are handled by MATH1/SYSLIB 及 GLpSET/GLpGET.

Below is a list of representation methods used by computers around us.

Computer	OS	Compiler	Floating-point representation method	Notes
FACOM	MSP(General purpose)	FORT77EX	IBM
FACOM	XMP(UNIX)	FORT77EX	IBM
HITAC	VOS3(General purpose)	??	IBM
HITAC	HIUXM(UNIX)	f77	IBM
SUN	UNIX	SUN FORTRAN	IEEE
PC9801	MS-DOS	F77L(Lahey)	IEEE
PC9801	MS-DOS	BASIC	Others*

Note (*): this is a binary representation that resembles IEEE, but it has a different elevation value for the exponent part and a different position for the sign bit.

Back|Forward

DCL:MISCㄠ:Summary

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
﹢	E+64							f₁				f₂				f₃				f₄				f₅				f₆

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
﹢	E+126								f₂＿f₄			f₅＿f₈				f₉＿f₁₂				f₁₃＿f₁₆				f₁₇＿f₂₀				f₂₁＿f₂₄