Converting long to float in C

Question

i'm having problems with large float numbers. I'm taking the l2-norm of some vectors and having problems when working with large point values. For example, consider vec as a vector:

float vec[] = { 10001.000000, 10002.000000, 10000.000000, 10003.000000,
        10003.000000, 10002.000000, 10003.000000 }; 
float sumzz = 0;
for (int i = 0; i < 7; i++) { 
     sumzz += pow(vec[i], 2);
   }

The output is '700280064', and it's wrong because the correct value is '700280036'.

So i tryed some stuff and i found that when i cast some large value to float it loses precision. Another example:

long num = 5502160332;
printf("%ld\n", num);
printf("%f\n", (float) num);

The output for the first print is clearly 5502160332, while the second is 5502160384. Am i doing something wrong? Is there a solution about this?

EDIT: as I mentioned in a comment, the problem is that i should use as less double values as it's possibile, because i'm working with CUDA and except for Tesla or high-end Quadro cards, double values have 1/32 efficiency compared to float or other types.

Just fyi, `%f` in `printf` expects a `double`, which, whether you knew it or not, is what you got after variadic argument promotion, but it's quantized first via your float cast. Try a cast to full double from inception in your test, eg. `printf("%f\n", (double) num);`. The results should be closer to your expectations. — WhozCraig, Nov 14 '19 at 11:43
7 significant decimal digits is about as good as it gets with `float.` Please use `double` for floating point except when there is very good reason not to. Please also read [Is floating point math broken?](http://stackoverflow.com/questions/588004/is-floating-point-math-broken) and [Why Are Floating Point Numbers Inaccurate?](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate) In floating point arithmetic there is a trade-off between *range* and *precision*. — Weather Vane, Nov 14 '19 at 11:43
Yes, i know that all this stuff works with double values, the problem is i can't use those because i'm working with CUDA and except you have a Tesla you can't handle double values efficiently. — tucossss, Nov 14 '19 at 11:47
The most recent GPUs, such as the GTX 280s in barracuda04 and barracuda10, do support double-precision. However, by default the CUDA compiler does not use double-precision arithmetic. — Weather Vane, Nov 14 '19 at 11:49
Why do you need floating point at all? You don't need `pow` to square a value. — Weather Vane, Nov 14 '19 at 11:52
Well, the first step is to get the l2-norm of a set of vectors. Those norms are fed to some Matrix-Matrix multiplications with CUBLAS, and the resulting matrices have a small percentage of the values completely wrong (NaN or zeros where they cannot be zero). I tried some stuff and then i found the problem was that l2-norm for large values vectors was simply wrong. — tucossss, Nov 14 '19 at 11:59
Nope, that's the core of the application. Speedup from using CUDA is huge. — tucossss, Nov 14 '19 at 12:06
What's wrong with `pow(vec[i], 2)` -> `vec[i] * vec[i]`? This is probably even faster, but the accuracy will probably be the same... — Jabberwocky, Nov 14 '19 at 12:41
@Jabberwocky The accuracy will quite possibly be *better*, because now the compiler can optimize the accumulation using FMA (fused multiply-add). — njuffa, Nov 14 '19 at 17:06
What range of values does your program have to work with and how many such values are you at most going to accumulate? It's pointless to talk and worry about working with larger precision without first knowing why the smaller precision is not sufficient and knowing that the larger precision will be sufficient… — Michael Kenzel, Nov 16 '19 at 11:38
What does your *actual* CUDA code look like? How do you *actually* perform the accumulation in CUDA? I don't see any parallelization in the example code above. If you do not parallelize, you will almost certainly be better off not running this on the GPU at all… — Michael Kenzel, Nov 16 '19 at 11:43

score 1 · Answer 1 · answered Nov 14 '19 at 13:43

1

If you insist on using floats you have no choice than to accept the limited accuracy.

But since the limited accuracy makes your program fail giving NaN and 0 entries in your later matrix there is simply nothing to do but use double. And even that has limits, they just are a bit larger.

In this case your choice is 1/32 the speed or no result at all I'm afraid. Or look for a different algorithm to construct your matrix that's less suseptible to inaccuracies.

PS: You can keep your vector in floats, then cast to double to compute the matrix and cast that back to float. So anything before and after the accuracy critical step can remain fast.

answered Nov 14 '19 at 13:43

Goswin von Brederlow

11,875
2
24
42

Well, yeah, that's all. In fact i tried to use cublasDgemm (M-M multiplication with 'double' values) instead of cublasSgemm (the same but with 'float' values) and the results are quite good. I lose between 10 and 20% of speedup, but it's still huge compared to CPU only code, so i think CUBLAS checks data passed to it and doesn't work with 'double' if is not strictly necessary. – tucossss Nov 14 '19 at 16:33
" so i think CUBLAS checks data passed to it and doesn't work with 'double' if is not strictly necessary." Nothing like that happens. If you call `cublasDgemm`, I assure you the calculations are carried out using FP64 arithmetic. It's possible that this particular gemm call is not a major performance limiter in your code, due to other aspects of your code (other work being done). – Robert Crovella Nov 15 '19 at 13:48
I tested only the CUBLAS part of the program using double vs using float and the results are similar. Using double i lose nearly 30-50% in performance. I can't explain the reasons about that performance. – tucossss Nov 25 '19 at 20:02

Converting long to float in C

1 Answers1