Floating Point Memory Format
The general idea behind floating point numbers is the same as behind scientific notation: write a mantissa, combined with an exponent to move the decimal point. E.g. $1.093 \times 10^5 = 10^5 \sum_{k=0}^3 a_k 10^{-k}$ with $a_k = 1,0,9,3$. The exact same thing can be done in binary:
In memory, floats can be stored by reserving a section of the total available bits to store the exponent, while the mantissa is stored in the remaining bits. For a double precision float, the total size is 64 bit. One bit is used to encode the sign of the number, 11 bits for the exponent, and the remaining 52 bits for the mantissa. In order to only have to store positive integers as exponents, as fixed bias exponent of 1023 is subtracted from the stored exponent. Also, the digit $a_0$ is assumed to be 1, and is not stored explicitly in the mantissa; the effective number of mantissa digits is therefore 53. For details, see the Wikipedia article for the Double precision floating-point format. The Single precision floating-point format is very similar, but only uses 32 total bits.
Parsing Double Precision Floats in Python
It is relatively easy to parse a binary representation of a double precision float. I've written a python script to do so (both for double and single precision), available on Github.
Given a representation (as a hex string) or a float value (as a string like "1.2" or "-2.3456e-192") the scripts decompose the binary representation into sign, exponent, and mantissa, and print out this information along with the exact decimal the binary representation corresponds to according to the floating point model.
A similar online tool for double precision floats is available at binaryconvert.com (also single precision and others).
The key routine:
BIAS = 1023 # constant to be subtracted from the stored exponent
SIGN_BIT = 63 # bit index at which the sign is stored
EXP_BIT = 52 # bit index at which the exponent starts
EXP_MASK = 0x000fffffffffffff # bit-mask to remove exponent, leaving mantissa
NAN_EXP = 0x7ff # special exponent which encodes NaN/Infinity
def parse_hex(hexstring, float_format='%.15e', no_decimal=False):
""" Take a 8-byte hex string (16 digits) representing a double precision,
parse it, and print detailed information about the represented float
value.
"""
bits = int('0x%s' % hexstring, 16)
sign = '+1'
if test_bit(bits, SIGN_BIT) > 0:
sign = '-1'
bits = clear_bit(bits, SIGN_BIT)
stored_exp = bits >> EXP_BIT
mantissa = bits & EXP_MASK # mask the exponent bits
print ""
print "Bytes = 0x%s" % hexstring
print "Float = "+ float_format \
% struct.unpack('!d', hexstring.decode('hex'))[0]
print "Sign = %s" % sign
if stored_exp == 0:
print "Exponent = 0x%x (Special: Zero/Subnormal)" % stored_exp
print "Mantissa = 0x%x" % mantissa
if not no_decimal:
if mantissa == 0:
print "Exact Decimal = %s0" % sign[0]
else:
print "Exact Decimal = %s (subnormal)" \
% float2decimal(hex2float(hexstring))
elif stored_exp == NAN_EXP:
print "Exponent = 0x%x (Special: NaN/Infinity)" % stored_exp
print "Mantissa = 0x%x" % mantissa
if not no_decimal:
if mantissa == 0:
print "Exact Decimal = %sInfinity" % sign[0]
else:
print "Exact Decimal = NaN"
else:
print "Exponent = 0x%x = %i (bias %i)" % (stored_exp,
stored_exp, BIAS)
print "Mantissa = 0x%x" % mantissa
if not no_decimal:
mantissa = set_bit(mantissa, EXP_BIT) # set the implicit bit
print "Exact Decimal = %s 2^(%i) * [0x%x * 2^(-52)]" \
% (sign[0], stored_exp-BIAS, mantissa)
print " = %s" % float2decimal(hex2float(hexstring))
The procedure is quite straightforward:
- Extract the sign from the left-most bit, and set it to 0
- Extract the stored exponent by shifting out the mantissa; to get the actual exponent the bias exponent (1023) has to be subtracted
- Extract the mantissa by setting all bits of the exponential to 0 with a suitable bitmask
Any float can be represented as an exact decimal, without loss of information.
The calculating of the exact decimal is not completely trivial to implement.
The basic idea is to double the mantissa until it is a (possibly very large)
integer. A routine float2decimal
that does this is available in
the FAQ for the Decimal module.
Connection to the Fortran Numerical Model
In Fortran, the numerical model is described as follows (see the Fortran Handbook section 13.2.3):
Note that in good Fortran fashion, the index $k$ counts from 1 as opposed to 0. This means that the exponential $e$ is shifted by one compared to the more traditional formula used in the "Floating Point Memory Format" section above.
The specific parameters for single and double precision for a given compiler/system can be found using the very useful kindfinder utility. On my system, I get the following:
FLOATINGPOINT MODEL (Real/Complex): Name: Single Double Extnd1 KIND: 4 8 16 DIGITS: 24 53 113 RADIX: 2 2 2 MINEXPONENT: -125 -1021 -16381 MAXEXPONENT: 128 1024 16384 PRECISION: 6 15 33 RANGE: 37 307 4931 EPSILON: 1.192E-07 2.220E-16 1.926E-34 HUGE: 3.403E+38 1.798+308 1.190+4932 TINY: 1.175E-38 2.225-308 3.362-4932
Looking at double precision floats, the DIGITS
corresponds to the total
number of mantissa digits, $p$ in the model. Note that DIGITS
is 53, whereas
only 52 bits are used to store the mantissa in memory format: the 53rd bit for
$2^0$ is implicit, as discussed earlier.
TINY
and HUGE
are the smallest and largest representable numbers,
respectively. The parse_dp_float.py
script gives the following information:
For TINY
:
Bytes = 0x0010000000000000 Float = 2.225073858507201e-308 Sign = +1 Exponent = 0x1 = 1 (bias 1023) Mantissa = 0x0 Exact Decimal = + 2^(-1022) * [0x10000000000000 * 2^(-52)] = 2.225073858507201383090232717332404064219215980462331830553327 416887204434813918195854283159012511020564067339731035811005152434161553460108 856012385377718821130777993532002330479610147442583636071921565046942503734208 375250806650616658158948720491179968591639648500635908770118304874799780887753 749949451580451605050915399856582470818645113537935804992115981085766051992433 352114352390148795699609591288891602992641511063466313393663477586513029371762 047325631781485664350872122828637642044846811407613911477062801689853244110024 161447421618567166150540154285084716752901903161322778896729707373123334086988 983175067838846926092773977972858659654941091369095406136467568702398678315290 680984617210924625396728515625E-308
For HUGE
:
Bytes = 0x7fefffffffffffff Float = 1.797693134862316e+308 Sign = +1 Exponent = 0x7fe = 2046 (bias 1023) Mantissa = 0xfffffffffffff Exact Decimal = + 2^(1023) * [0x1fffffffffffff * 2^(-52)] = 17976931348623157081452742373170435679807056752584499659891747 680315726078002853876058955863276687817154045895351438246423432132688946418276 846754670353751698604991057655128207624549009038932894407586850845513394230458 323690322294816580855933212334827479782620414472316873817718091929988125040402 6184124858368
Note the MINEXPONENT = EXPONENT(TINY)
and MAXEXPONENT = EXPONENT(HUGE)
,
where EXPONENT
is the Fortran function that returns the exponent in the
Fortran numerical model. These values are shifted by one compared to the stored
exponent in the binary representation, due to the sum in the numerical model
starting at one, as discussed above.
The PRECISION
is the number of significant digits (the digits after the
decimal point) that are guaranteed to be represented accurately; That is,
when you assign a number to a double precision float, the number actually
stored will always match the assigned number within at least 15 significant
digits.
The RANGE
is defined as INT(MIN( LOG10(HUGE), -LOG10(TINY) ))
, in this case
RANGE = INT(-LOG10(2.225e-308)) = INT(- (308 + 0.347)) = INT(307.653) = 307
.
Finally, EPSILON
is the difference between one and the next closest
representable number, given as $2^{-52}$. This is the smallest difference for
any two neighboring numbers greater than one. 0.5*EPSILON
provides an upper
bound for the relative error due to rounding.
Some final observations:
-
Two distinct double precision numbers can look the same when printed to 15 significant digits. The best example for this is
one + EPSILON(one)
. However, they will always differ in the 16th significant digit in that case. -
When printing a float to an arbitrary precision (greater the machine precision), there are never any random digits printed. The exact decimal encoded by the float determines the result. However, the digits beyond the machine precision resulting from a computation involving two floats may depend and the compiler and the environment so that in general, results obtained on different systems cannot be expected to match beyond the machine precision.
-
To test whether two floats
r1
andr2
are the same within some relative errordelta_r
, the following expression can be used:approx_eq = (ABS(r1 - r2) <= 0.5d0*ABS(r1+r2) * delta_r)
It makes no sense to use a
delta_r < 1.0d-16
; the test would be equivalent to a direct comparison on the bit-level. -
To write out the hex representation of a double precision float
r
in Fortran, usewrite(*,'(Z16)') transfer(r, 1_8)
. -
Read What Every Computer Scientist Should Know about Floating Numbers