Floating Point Memory Format

The general idea behind floating point numbers is the same as behind scientific notation: write a mantissa, combined with an exponent to move the decimal point. E.g. \(1.093 \times 10^5 = 10^5 \sum_{k=0}^3 a_k 10^{-k}\) with \(a_k = 1,0,9,3\). The exact same thing can be done in binary:

$$ v = 2^e \sum_{k=0}^{n-1} a_k 2^{-k} $$

In memory, floats can be stored by reserving a section of the total available bits to store the exponent, while the mantissa is stored in the remaining bits. For a double precision float, the total size is 64 bit. One bit is used to encode the sign of the number, 11 bits for the exponent, and the remaining 52 bits for the mantissa. In order to only have to store positive integers as exponents, as fixed bias exponent of 1023 is subtracted from the stored exponent. Also, the digit \(a_0\) is assumed to be 1, and is not stored explicitly in the mantissa; the effective number of mantissa digits is therefore 53. For details, see the Wikipedia article for the Double precision floating-point format. The Single precision floating-point format is very similar, but only uses 32 total bits.

Parsing Double Precision Floats in Python

It is relatively easy to parse a binary representation of a double precision float. I’ve written a python script to do so (both for double and single precision), available on Github.

Given a representation (as a hex string) or a float value (as a string like “1.2” or “-2.3456e-192”) the scripts decompose the binary representation into sign, exponent, and mantissa, and print out this information along with the exact decimal the binary representation corresponds to according to the floating point model.

A similar online tool for double precision floats is available at binaryconvert.com (also single precision and others).

The key routine:

BIAS = 1023                # constant to be subtracted from the stored exponent
SIGN_BIT = 63              # bit index at which the sign is stored
EXP_BIT = 52               # bit index at which the exponent starts
EXP_MASK = 0x000fffffffffffff  # bit-mask to remove exponent, leaving mantissa
NAN_EXP = 0x7ff            # special exponent which encodes NaN/Infinity

def parse_hex(hexstring, float_format='%.15e', no_decimal=False):
    """ Take a 8-byte hex string (16 digits) representing a double precision,
        parse it, and print detailed information about the represented float
    bits = int('0x%s' % hexstring, 16)
    sign = '+1'
    if test_bit(bits, SIGN_BIT) > 0:
        sign = '-1'
    bits = clear_bit(bits, SIGN_BIT)
    stored_exp = bits >> EXP_BIT
    mantissa = bits & EXP_MASK # mask the exponent bits

    print ""
    print "Bytes         = 0x%s" % hexstring
    print "Float         = "+ float_format \
                           % struct.unpack('!d', hexstring.decode('hex'))[0]
    print "Sign          = %s" % sign
    if stored_exp == 0:
        print "Exponent      = 0x%x (Special: Zero/Subnormal)" % stored_exp
        print "Mantissa      = 0x%x" % mantissa
        if not no_decimal:
            if mantissa == 0:
                print "Exact Decimal = %s0" % sign[0]
                print "Exact Decimal = %s (subnormal)" \
                    % float2decimal(hex2float(hexstring))
    elif stored_exp == NAN_EXP:
        print "Exponent      = 0x%x (Special: NaN/Infinity)" % stored_exp
        print "Mantissa      = 0x%x" % mantissa
        if not no_decimal:
            if mantissa == 0:
                print "Exact Decimal = %sInfinity" % sign[0]
                print "Exact Decimal = NaN"
        print "Exponent      = 0x%x = %i (bias %i)" % (stored_exp,
                                                       stored_exp, BIAS)
        print "Mantissa      = 0x%x" % mantissa
        if not no_decimal:
            mantissa = set_bit(mantissa, EXP_BIT) # set the implicit bit
            print "Exact Decimal = %s 2^(%i) * [0x%x * 2^(-52)]" \
                                % (sign[0], stored_exp-BIAS, mantissa)
            print "              = %s" % float2decimal(hex2float(hexstring))

The procedure is quite straightforward:

  • Extract the sign from the left-most bit, and set it to 0
  • Extract the stored exponent by shifting out the mantissa; to get the actual exponent the bias exponent (1023) has to be subtracted
  • Extract the mantissa by setting all bits of the exponential to 0 with a suitable bitmask

Any float can be represented as an exact decimal, without loss of information. The calculating of the exact decimal is not completely trivial to implement. The basic idea is to double the mantissa until it is a (possibly very large) integer. A routine float2decimal that does this is available in the FAQ for the Decimal module.

Connection to the Fortran Numerical Model

In Fortran, the numerical model is described as follows (see the Fortran Handbook section 13.2.3):

$$ x = s b^e \sum_{k=1}^p f_k b^{-k} $$

Note that in good Fortran fashion, the index \(k\) counts from 1 as opposed to 0. This means that the exponential \(e\) is shifted by one compared to the more traditional formula used in the “Floating Point Memory Format” section above.

The specific parameters for single and double precision for a given compiler/system can be found using the very useful kindfinder utility. On my system, I get the following:

 Name:                     Single              Double              Extnd1
 KIND:                          4                   8                  16
 DIGITS:                       24                  53                 113
 RADIX:                         2                   2                   2
 MINEXPONENT:                -125               -1021              -16381
 MAXEXPONENT:                 128                1024               16384
 PRECISION:                     6                  15                  33
 RANGE:                        37                 307                4931
 EPSILON:               1.192E-07           2.220E-16           1.926E-34
 HUGE:                  3.403E+38           1.798+308          1.190+4932
 TINY:                  1.175E-38           2.225-308          3.362-4932

Looking at double precision floats, the DIGITS corresponds to the total number of mantissa digits, \(p\) in the model. Note that DIGITS is 53, whereas only 52 bits are used to store the mantissa in memory format: the 53rd bit for \(2^0\) is implicit, as discussed earlier.

TINY and HUGE are the smallest and largest representable numbers, respectively. The parse_dp_float.py script gives the following information:


Bytes         = 0x0010000000000000
Float         = 2.225073858507201e-308
Sign          = +1
Exponent      = 0x1 = 1 (bias 1023)
Mantissa      = 0x0
Exact Decimal = + 2^(-1022) * [0x10000000000000 * 2^(-52)]
              = 2.225073858507201383090232717332404064219215980462331830553327


Bytes         = 0x7fefffffffffffff
Float         = 1.797693134862316e+308
Sign          = +1
Exponent      = 0x7fe = 2046 (bias 1023)
Mantissa      = 0xfffffffffffff
Exact Decimal = + 2^(1023) * [0x1fffffffffffff * 2^(-52)]
              = 17976931348623157081452742373170435679807056752584499659891747

Note the MINEXPONENT = EXPONENT(TINY) and MAXEXPONENT = EXPONENT(HUGE), where EXPONENT is the Fortran function that returns the exponent in the Fortran numerical model. These values are shifted by one compared to the stored exponent in the binary representation, due to the sum in the numerical model starting at one, as discussed above.

The PRECISION is the number of significant digits (the digits after the decimal point) that are guaranteed to be represented accurately; That is, when you assign a number to a double precision float, the number actually stored will always match the assigned number within at least 15 significant digits.

The RANGE is defined as INT(MIN( LOG10(HUGE), -LOG10(TINY) )), in this case RANGE = INT(-LOG10(2.225e-308)) = INT(- (308 + 0.347)) = INT(307.653) = 307.

Finally, EPSILON is the difference between one and the next closest representable number, given as \(2^{-52}\). This is the smallest difference for any two neighboring numbers greater than one. 0.5*EPSILON provides an upper bound for the relative error due to rounding.

Some final observations:

  • Two distinct double precision numbers can look the same when printed to 15 significant digits. The best example for this is one + EPSILON(one). However, they will always differ in the 16th significant digit in that case.

  • When printing a float to an arbitrary precision (greater the machine precision), there are never any random digits printed. The exact decimal encoded by the float determines the result. However, the digits beyond the machine precision resulting from a computation involving two floats may depend and the compiler and the environment so that in general, results obtained on different systems cannot be expected to match beyond the machine precision.

  • To test whether two floats r1 and r2 are the same within some relative error delta_r, the following expression can be used:

    approx_eq = (ABS(r1 - r2) <= 0.5d0*ABS(r1+r2) * delta_r)

    It makes no sense to use a delta_r < 1.0d-16; the test would be equivalent to a direct comparison on the bit-level.

  • To write out the hex representation of a double precision float r in Fortran, use write(*,'(Z16)') transfer(r, 1_8).

  • Read What Every Computer Scientist Should Know about Floating Numbers


No comments

You may format you comment with Markdown. If your comment is a valid contribution, it may be posted on this page. Your email address will not be made public.