Questions tagged [single-precision]

57 questions
38
votes
6 answers

Building a 32-bit float out of its 4 composite bytes

I'm trying to build a 32-bit float out of its 4 composite bytes. Is there a better (or more portable) way to do this than with the following method? #include typedef unsigned char uchar; float bytesToFloat(uchar b0, uchar b1, uchar b2,…
Madgeek
  • 383
  • 1
  • 3
  • 6
13
votes
5 answers

How does C know what type to expect?

If all values are nothing more than one or more bytes, and no byte can contain metadata, how does the system keep track of what sort of number a byte represents? Looking into Two's Complement and Single Point on Wikipedia reveals how these numbers…
Jack Stout
  • 1,265
  • 3
  • 12
  • 25
10
votes
4 answers

Why is a float "single precision"?

I'm curious as to why the IEEE calls a 32-bit floating-point number single precision. Was it just a means of standardization, or does 'single' actually refer to a single 'something'. Is it simply a standardized level? As in, precision level 1…
Keith Grout
  • 899
  • 11
  • 30
8
votes
1 answer

CUDA C using single precision flop on doubles

The problem During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. In the project, I first fill an array with number in a kernel and in another kernel, I do some…
Frank
  • 362
  • 3
  • 13
7
votes
1 answer

Single-precision arithmetic broken when running x86-compiled code on a 64-bit machine

When you read MSDN on System.Single: Single complies with the IEC 60559:1989 (IEEE 754) standard for binary floating-point arithmetic. and the C# Language Specification: The float and double types are represented using the 32-bit…
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
5
votes
2 answers

Single precision argument reduction for trigonometric functions in C

I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp. My target device does not support any or methods.…
4
votes
2 answers

Approximating cosine on [0,pi] using only single precision floating point

i'm currently working on an approximation of the cosine. Since the final target device is a self-developement working with 32 bit floating point ALU / LU and there is a specialized compiler for C, I am not able to use the c library math functions…
4
votes
3 answers

How to keep precision on int64_t = int64_t * float?

I would like to perform a correction on an int64_t by a factor in the range [0.01..1.2] with precision is about 0.01. The naive implementation would be: int64_t apply_correction(int64_t y, float32_t factor) { return y *…
nowox
  • 25,978
  • 39
  • 143
  • 293
4
votes
2 answers

Why IEEE754 single-precision float has only 7 digit precision?

Why does a single-precision floating point number have 7 digit precision (or double 15-16 digits precision)? Can anyone please explain how we arrive on that based on the 32 bits assigned for float(Sign(32) Exponent(30-23), Fraction (22-0))?
3
votes
1 answer

Does accessing the 4 bytes of a float break C++ aliasing rules

I need to read the binary content of a file and turn the extracted bytes into single precision floating point numbers. How to do this has already been asked here. That question does have proper answers but I'm wondering whether a particular answer…
ackh
  • 1,648
  • 2
  • 18
  • 36
3
votes
4 answers

How do I print the exact value stored in a float?

If I assign the value 0.1 to a float: float f = 0.1; The actual value stored in memory is not an exact representation of 0.1, because 0.1 is not a number that can be exactly represented in single-precision floating-point format. The actual value…
Hammerite
  • 21,755
  • 6
  • 70
  • 91
3
votes
1 answer

After converting bits to Double, how to store actual float/double value without using BigDecimal?

According to several floating point calculators and as well as my code below, the following 32 bits 00111111010000000100000110001001 has an actual Floating Point value of (0.750999987125396728515625). Since it is the actual Float value, I should…
3
votes
1 answer

C - adding two single-precision floating point normal numbers, can't get result to infinity

I'm playing around with floating-point arithmetic, and I encountered something which needs explaining. When setting rounding mode to 'towards zero', aka: fesetround(FE_TOWARDZERO); And adding different kind of normal positive numbers, I can never…
Shay Golan
  • 89
  • 7
3
votes
2 answers

subtracting double precision from single precision gives me 0. not what I want

i am trying to examining the round-off error associated with sin(x) using Octave I get these numbers: >> single(sin(10)) ans = -0.544021129608154 >> sin(10) ans = -0.544021110889370 >> (single(sin(10))) - (sin(10)) ans = 0 which should be :…
user35053
  • 33
  • 6
2
votes
2 answers

Double vs Float vs _Float16 (Running Time)

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software. I tested half, single, double with a…
1
2 3 4