2
#include <iostream>
#include <iomanip>
#include <cstdlib>

#define N 10000000

inline double frand() { return rand() / (double)RAND_MAX; }

double blockSum(double *arr, int n, int n_block) {
  double *sum_block = new double[n_block];
  double  sum = 0.0;

  for (int i=0; i<n_block; i++) {
    sum_block[i] = 0.0;
    for (int j=0; j<n/n_block; j++) {
      sum_block[i] += arr[n/n_block * i + j];
    }
  }

  for (int i=0; i<n_block; i++) {
    sum += sum_block[i];
  }

  delete [] sum_block;

  return sum;
}

int main() {
  double *darr = new double[N];
  srand(1973);
  for (int i=0; i<N; i++) {
    darr[i]   = frand();
  }

  double sum_tot = 0.0;
  for (int i=0; i<N; i++) {
    sum_tot += darr[i];
  }

  std::cout << "sum_tot              " << std::fixed << std::setprecision(10) << sum_tot << std::endl;

  double sum_block_tot;
  sum_block_tot = blockSum(darr, N, 1);
  std::cout << "sum_block_tot (1)    " << std::fixed << std::setprecision(10) << sum_block_tot << std::endl;

  sum_block_tot = blockSum(darr, N, 10);
  std::cout << "sum_block_tot (10)   " << std::fixed << std::setprecision(10) << sum_block_tot << std::endl;

  sum_block_tot = blockSum(darr, N, 100);
  std::cout << "sum_block_tot (100)  " << std::fixed << std::setprecision(10) << sum_block_tot << std::endl;

  sum_block_tot = blockSum(darr, N, 1000);
  std::cout << "sum_block_tot (1000) " << std::fixed << std::setprecision(10) << sum_block_tot << std::endl;

  delete [] darr;

  return 0;
sum_tot              4998539.4213524628
sum_block_tot (1)    4998539.4213524628
sum_block_tot (10)   4998539.4213524340
sum_block_tot (100)  4998539.4213524330
sum_block_tot (1000) 4998539.4213524330

In the above example, the total sum of darr gets different value depending on how we chunk the initial array and get total sum after partial block sum.

I am not asking why this happens - obviously this should be related to the floating point stuff. I want to know whether there are any definitive solution to mitigate (or even remove) this kind of dependency.

Especially with computations involving parallelization, there are a lot of situations that we divide the original data into chunks, do some calculations, and do reduction to main thread. It is not that good If the calculated values get different from time to time, depending on the hardware structure.

And what if one is conducting particle simulations which has billions of particles, but the results get different depending on the size of chunk, or the order of summation?

There are some comments on this to reduce the loss of digits, but most of them are not applicable to HPC applications - we just cannot sort the values by their size and decide the order of summation every time.

Are there any good way to do "good summations" in parallel computing context?

PS) I want to confirm that sum_block_tot (1) is the closest value to real sum, and the bigger the size of chunks, generally the results get degraded.

Sangjun Lee
  • 406
  • 2
  • 12

1 Answers1

0

If you can't sort the values, you can't improve the accuracy. a + b + c + d != (a + b) + (c + d) in floating point math.

You can use std::heap with blocks instead of std::sort of the whole range. It can increase the accuracy of each block and the accuracy of the sum of the blocks.

273K
  • 29,503
  • 10
  • 41
  • 64
  • I got interesting results. I just compared sum_tot with sorting the whole range and with making heaps of blocks. Got the worst accuracy with sum_tot and other sums very close. – 273K May 12 '23 at 06:44