205

What is the difference between a single precision floating point operation and double precision floating operation?

I'm especially interested in practical terms in relation to video game consoles. For example, does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations? Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of (if they exist?).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
meds
  • 21,699
  • 37
  • 163
  • 314
  • 17
    The fact that CPU is 64-bit usually means that CPU has 64-bit **general purpose registers** (i.e. integer) and **memory address size**. But it say nothing about floating point math. For example, Intel IA-32 CPUs are 32-bit, but they do natively support double precision floats. – Roman Zavalov Nov 26 '12 at 10:51
  • Double precision floating point operation can represents more numbers than single precision floating point. Here is a good read about floating point from programming perspective. https://levelup.gitconnected.com/why-floating-point-numbers-are-not-always-accurate-9a57e812ace1 – rjhcnf Dec 29 '20 at 08:47

11 Answers11

235

Note: the Nintendo 64 does have a 64-bit processor, however:

Many games took advantage of the chip's 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games, as well as the fact that processing 64-bit data uses twice as much RAM, cache, and bandwidth, thereby reducing the overall system performance.

From Webopedia:

The term double precision is something of a misnomer because the precision is not really double.
The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number.
For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.

The extra bits increase not only the precision but also the range of magnitudes that can be represented.
The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values.
Most computers use a standard format known as the IEEE floating-point format.

The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format, as well as a much greater range.

From the IEEE standard for floating point arithmetic

Single Precision

The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.

  • The first bit is the sign bit, S,

  • the next eight bits are the exponent bits, 'E', and

  • the final 23 bits are the fraction 'F':

    S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
    0 1      8 9                    31
    

The value V represented by the word may be determined as follows:

  • If E=255 and F is nonzero, then V=NaN ("Not a number")
  • If E=255 and F is zero and S is 1, then V=-Infinity
  • If E=255 and F is zero and S is 0, then V=Infinity
  • If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F). These are "unnormalized" values.
  • If E=0 and F is zero and S is 1, then V=-0
  • If E=0 and F is zero and S is 0, then V=0

In particular,

0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                     0.00000000000000000000001 = 
                                     2**(-149)  (Smallest positive value)

Double Precision

The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.

  • The first bit is the sign bit, S,

  • the next eleven bits are the exponent bits, 'E', and

  • the final 52 bits are the fraction 'F':

    S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
    0 1        11 12                                                63
    

The value V represented by the word may be determined as follows:

  • If E=2047 and F is nonzero, then V=NaN ("Not a number")
  • If E=2047 and F is zero and S is 1, then V=-Infinity
  • If E=2047 and F is zero and S is 0, then V=Infinity
  • If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
  • If E=0 and F is zero and S is 1, then V=-0
  • If E=0 and F is zero and S is 0, then V=0

Reference:
ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic.


From cs.uaf.edu notes on IEEE Floating Point Standard, "Fraction" is generally referenced as Mantissa.

The single precision IEEE FPS format is composed of 32 bits, divided into a 23 bit mantissa, M, an 8 bit exponent, E, and a sign bit, S:

tabular688

  • The normalized mantissa, m, is stored in bits 0-22 with the hidden bit, b0, omitted.
    Thus M = m-1.

  • The exponent, e, is represented as a bias-127 integer in bits 23-30.
    Thus, E = e+127.

  • The sign bit, S, indicates the sign of the mantissa, with S=0 for positive values and S=1 for negative values.

Zero is represented by E = M = 0.
Since S may be 0 or 1, there are different representations for +0 and -0.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 9
    I know that this from your source, but I don't like the sentence: "The term double precision is something of a misnomer because the precision is not really double." Single and Double precision these days are pretty universally defined by IEEE, and as you point out single precision has 23 bits in the fraction and double has 52 bits--that is basically double the precision... – Carl Walsh Jul 20 '12 at 20:34
  • 5
    @ZeroDivide '`**`' is **[Exponentiation](http://en.wikipedia.org/wiki/Exponentiation)** – VonC Aug 28 '13 at 05:23
  • 13
    @CarlWalsh 52/23 != 2 ergo it is not "double the precision" – rfoo Sep 28 '13 at 14:37
  • @johnson You have more details about unnormalized values in http://www.easy68k.com/paulrsm/6502/WOZFPPAK.TXT, and also in https://stackoverflow.com/a/28801033/6309 – VonC Dec 09 '17 at 07:21
  • 3
    @rfoo If you want to be pedantic sure, it's not *exactly* double, but 52/2 > 23 so yes, it is double the precision, it's just double and then some more. – JShorthouse Nov 08 '19 at 14:14
  • The source link for the second quote is dead, which is probably just as well since it seems not very well written. In the context of that quote, the sentence, "The extra bits increase not only the precision but also the range of magnitudes that can be represented," seems to imply that we cannot simultaneously double the precision and increase the range, whereas in fact we have _more_ than doubled the precision. I'm sure the author knew this, and probably what they meant by "misnomer" is that "double-precision" understates the actual improvement in the representation. – David K Nov 27 '19 at 13:38
  • @DavidK Thank you. I have restored the link. The passage you mention is from https://www.webopedia.com/TERM/D/double_precision.html, which is still up. You can edit the answer to add/include your comment. – VonC Nov 27 '19 at 14:02
  • On further thought the Webopedia entry isn't as bad as I said. It's a worthwhile link. At your invitation, I added one sentence underneath it to clarify the relationship of the two IEEE formats. I'm not fixated on the exact wording. – David K Nov 27 '19 at 21:56
  • @DavidK Thank you. Don't hesitate to revisit this answer of you have any new element to add. – VonC Nov 27 '19 at 22:44
  • Would the F (Fraction) be the same as Mantissa ? – theMyth Mar 30 '23 at 08:25
  • @theMyth Yes, as [stated below](https://stackoverflow.com/a/42444685/6309): The single precision IEEE FPS format is composed of 32 bits, divided into a 23 bit mantissa, `M`, an 8 bit exponent, `E`, and a sign bit, `S`. The **normalized mantissa**, `m`, is stored in bits 0-22 with the hidden bit omitted. Thus `M = m-1`. – VonC Mar 30 '23 at 09:51
60

I read a lot of answers but none seems to correctly explain where the word double comes from. I remember a very good explanation given by a University professor I had some years ago.

Recalling the style of VonC's answer, a single precision floating point representation uses a word of 32 bit.

  • 1 bit for the sign, S
  • 8 bits for the exponent, 'E'
  • 24 bits for the fraction, also called mantissa, or coefficient (even though just 23 are represented). Let's call it 'M' (for mantissa, I prefer this name as "fraction" can be misunderstood).

Representation:

          S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM
bits:    31 30      23 22                     0

(Just to point out, the sign bit is the last, not the first.)

A double precision floating point representation uses a word of 64 bit.

  • 1 bit for the sign, S
  • 11 bits for the exponent, 'E'
  • 53 bits for the fraction / mantissa / coefficient (even though only 52 are represented), 'M'

Representation:

           S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
bits:     63 62         52 51                                                  0

As you may notice, I wrote that the mantissa has, in both types, one bit more of information compared to its representation. In fact, the mantissa is a number represented without all its non-significative 0. For example,

  • 0.000124 becomes 0.124 × 10−3
  • 237.141 becomes 0.237141 × 103

This means that the mantissa will always be in the form

0.α1α2...αt × βp

where β is the base of representation. But since the fraction is a binary number, α1 will always be equal to 1, thus the fraction can be rewritten as 1.α2α3...αt+1 × 2p and the initial 1 can be implicitly assumed, making room for an extra bit (αt+1).

Now, it's obviously true that the double of 32 is 64, but that's not where the word comes from.

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

With that said, it's easy to estimate the number of decimal digits which can be safely used:

  • single precision: log10(224), which is about 7~8 decimal digits
  • double precision: log10(253), which is about 15~16 decimal digits
Alessandro
  • 1,046
  • 1
  • 11
  • 19
  • 3
    Thanks for using the correct bit numbering (the sign being the 31st and 63rd bit, respectively). – Jack_Hu Aug 29 '20 at 10:55
  • I'm not sure why you claim it is specifically *decimal* precision that is the relevant measure. The precision ratio is independent of numeric base, but it is easiest to draw it directly from the binary precisions: 53 bits / 24 bits. (The same ratio can be drawn from your base-10 logarithms, though). binary64 has a little *more* than double the precision of binary32 (plus a wider exponent range), so "double precision" isn't so bad description, but it isn't exactly correct, either. – John Bollinger Mar 22 '23 at 22:24
  • Also, it's completely reasonable to observe that numeric precision is a different thing from storage size, but I think that storage size actually *is* how the term "double precision" arose, even though that makes the term a little less, um, precise. It was in use at least as far back as the early days of Fortran, when floating-point formats were much more varied than they are today, long before IEEE 754 was drafted. And taking Fortran 77 as an example, the storage size of a `DOUBLE PRECISION` object is explicitly specified to be twice that of a `REAL` or `INTEGER`, format notwithstanding. – John Bollinger Mar 22 '23 at 22:38
21

Okay, the basic difference at the machine is that double precision uses twice as many bits as single. In the usual implementation,that's 32 bits for single, 64 bits for double.

But what does that mean? If we assume the IEEE standard, then a single precision number has about 23 bits of the mantissa, and a maximum exponent of about 38; a double precision has 52 bits for the mantissa, and a maximum exponent of about 308.

The details are at Wikipedia, as usual.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
14

To add to all the wonderful answers here

First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, float can store double the amount of fractional part. That is why Double is called double the float

SimpleGuy
  • 2,764
  • 5
  • 28
  • 45
9

All have explained in great detail and nothing I could add further. Though I would like to explain it in Layman's Terms or plain ENGLISH

1.9 is less precise than 1.99
1.99 is less precise than 1.999
1.999 is less precise than 1.9999

.....

A variable, able to store or represent "1.9" provides less precision than the one able to hold or represent 1.9999. These Fraction can amount to a huge difference in large calculations.

Asad
  • 21,468
  • 17
  • 69
  • 94
8

Basically single precision floating point arithmetic deals with 32 bit floating point numbers whereas double precision deals with 64 bit.

The number of bits in double precision increases the maximum value that can be stored as well as increasing the precision (ie the number of significant digits).

cletus
  • 616,129
  • 168
  • 910
  • 942
7

As to the question "Can the ps3 and xbxo 360 pull off double precision floating point operations or only single precision and in generel use is the double precision capabilities made use of (if they exist?)."

I believe that both platforms are incapable of double floating point. The original Cell processor only had 32 bit floats, same with the ATI hardware which the XBox 360 is based on (R600). The Cell got double floating point support later on, but I'm pretty sure the PS3 doesn't use that chippery.

codekaizen
  • 26,990
  • 7
  • 84
  • 140
2

Double precision means the numbers takes twice the word-length to store. On a 32-bit processor, the words are all 32 bits, so doubles are 64 bits. What this means in terms of performance is that operations on double precision numbers take a little longer to execute. So you get a better range, but there is a small hit on performance. This hit is mitigated a little by hardware floating point units, but its still there.

The N64 used a MIPS R4300i-based NEC VR4300 which is a 64 bit processor, but the processor communicates with the rest of the system over a 32-bit wide bus. So, most developers used 32 bit numbers because they are faster, and most games at the time did not need the additional precision (so they used floats not doubles).

All three systems can do single and double precision floating operations, but they might not because of performance. (although pretty much everything after the n64 used a 32 bit bus so...)

Alex
  • 4,316
  • 2
  • 24
  • 28
1

First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, double can store double the amount of fractional part as of float. That is why Double is called double the float

realdeb
  • 1,118
  • 12
  • 12
0

According to the IEEE754 • Standard for floating point storage • 32 and 64 bit standards (single precision and double precision) • 8 and 11 bit exponent respectively • Extended formats (both mantissa and exponent) for intermediate results

-3

Single precision number uses 32 bits, with the MSB being sign bit, whereas double precision number uses 64bits, MSB being sign bit

Single precision

SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)

Double precision:

SEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)

Steve Bennett
  • 114,604
  • 39
  • 168
  • 219