2

I am trying to perform summation of an array.

There are 1024 elements in that array and on applying the command "sum(a(:))", where a(:) is the array, I get the answer 1981.9072.

If I do the same summation of 1024 elements in Excel sheet the answer is 1981.93530 which is the right answer. So, a difference of 0.0281 is observed between the above two values. As I increase the number of elements in a(:) the difference in value obtained from "sum()" and Excel sheet increases.

I think the "sum()" value is different due to rounding off error. How do I get the true value (Excel value) using "sum()" without any rounding off error ?

K. Jadhav
  • 21
  • 1

1 Answers1

1

REAL(4) is a 32-bit IEEE-754 floating-point number which is imprecise.

You absolutely need to read and understand this important QA on the matter: Why are floating point numbers inaccurate?

Here's a page on the same topic from the GCC Fortran compiler docs: https://gcc.gnu.org/onlinedocs/gcc-3.3.6/g77/Floating_002dpoint-Errors.html

You need to change the type of a to REAL(8) - but note that even if you do, the sum still won't be completely accurate (it will just be less inaccurate).


This can be reproduced consistently in any language that implements IEEE-754 32-bit floating-point numbers, like C# (I'm using C# as I don't have access to a Fortran90 compiler right now):

static void Main()
{
    Single t = 0f;

    for( Int32 i = 0; i < 1024; i++ )
    {
        t += 1.93548369f; // This is a 32-bit float literal
    }

    Console.WriteLine( t ); // "1981.9072"
}

If you sum the numbers using a double-precision (64-bit) floating number type you get the same result as Excel:

static void Main()
{
    Double d = 0d;

    for( Int32 i = 0; i < 1024; i++ )
    {
        d += 1.93548369d; // This is a 64-bit float literal
    }

    Console.WriteLine( d ); // "1981.935302734375" --> "1981.9350"
}

If you want an accurate and precise answer then you need to use a data-type that doesn't use IEEE-754 - I'm not a Fortran user so I don't know what/if Fortran's equivalent of C#'s Decimal is, but when I do the calculation with Decimal I get 1981.93529856 which is different to Excel's answer (this is because Excel uses 64-bit IEEE-754 instead of a real Decimal type):

static void Main()
{
    Decimal dec = 0M;
    
    for( Int32 i = 0; i < 1024; i++ )
    {
        dec += 1.93548369M; // This is a decimal literal, note the `M`.
    }

    Console.WriteLine( dec ); // "1981.93529856" --> "1981.9352"
}
Dai
  • 141,631
  • 28
  • 261
  • 374
  • 3
    REAL(4) *might* be 32-bit IEEE-754 floating-point or could be anything else. It does not have to exist at all. REAL(8) *might* be 64-bit IEEE-754 floating-point or could be anything else. It does not have to exist at all. https://stackoverflow.com/questions/838310/fortran-90-kind-parameter I strongly suggest **NOT** to use magic numbers like 4 or 8 for kind numbers. – Vladimir F Героям слава Jan 29 '21 at 08:44
  • 3
    Also be aware that your link goes to the manual of g77 that wasn't even a Fortran 90 compiler and, in particular, kinds 4 and 8 did **NOT** exist there. g77 used kind numbers 1 and 2 for these (as you can easily see yourself in your link). – Vladimir F Героям слава Jan 29 '21 at 08:45
  • It would be very good if you could edit your answer to reflect the information Vladimit has provided – Ian Bush Jan 29 '21 at 09:14
  • @IanBush I am not a Fortran expert so I'm wary of being even more incorrect - both you and Vladimir have sufficient SO reputation to edit my answer on my behalf. – Dai Jan 29 '21 at 09:26
  • The `sum` intrinsic is most often implemented via the naive and grade-school algorithm of well simply adding the individual elements up without consideration of round-off error. Why? Because this is fast. If you want a possibly more accurate summation,then search for **Kahan summation** or **summation with carry**. If you implement this algorithm you absolutely want to avoid compiler options that cause the compiler to violate Fortran's expression evaluation rules. – steve Jan 29 '21 at 18:36