Fix arithmetic error in distributed version

Question

I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed project produces correct results, I had to compare it with the serial version. The results are not exactly the same!

For example, the last five cells of the result matrix are:

serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757

I had post in the Intel forum about that, but the answer I got was about getting the same results across all the executions I will make with the distributed version, something that I already had. They seem (in another thread) to be unable to respond to this:

How to get same results, between serial and distributed execution? Is this possible? This would result in fixing the arithmetic error.

I have tried setting this: mkl_cbwr_set(MKL_CBWR_AVX); and using mkl_malloc(), in order to align memory, but nothing changed. I will get the same results, only in the case that I will spawn one process for the distributed version (which will make it almost serial)!

The distributed routines I am calling: pdpotrf() and pdpotri().

The serial routines I am calling: dpotrf() and dpotri().

Do you get the same results every time you run the serial version or are there small variances in the results between serial runs? — Matt, Aug 18 '15 at 14:42
Sorry about adding multiple successive comments; I did some more research and this is a somewhat common problem in HPC and distributed computing, many large problems and benchmarks include a step that verifies the validity of the obtained. Even the [High Performance Linpack](http://www.netlib.org/benchmark/hpl/) (HPL) test used to benchmark the fastest computers in the world has a step for validating the answer that is calculated when solving the problem. There are several different techniques for validating an answer if that is a route you would like to pursue I can post an answer. — Matt, Aug 18 '15 at 14:50
@Matt thanks for commenting, yes the results are re-producable every time I run the serial/distributed version. The problem is that both versions solve the same problem, by using the same algorithms, but the output a bit different results, because of the nature of distributed execution, which leads to a different order of operations to take place, resulting in minor differences. The question is how (if it is possible) I can make the distributed version give the same result as the serial version. Could you elaborate a bit on your second comment please? — gsamaras, Aug 18 '15 at 14:58
One example of validating an answer would be calculating the percent error between the distributed result and actual result. then set a threshold for valid answers, for example would percent error between the two results of less than .0001% be considered a valid answer? — Matt, Aug 18 '15 at 15:07
Yes @Matt that's what I thought, but I wanted to be sure. So, you are saying that there is nothing that I can do to achieve same results between serial and distributed executions, but I can use a threshold value for validating the distributed answer, right? — gsamaras, Aug 18 '15 at 15:09
On an unrelated note on the serial version are you still using the MKL and if so do you know what the value of the `MKL_NUM_THREADS` environment variable is? — Matt, Aug 18 '15 at 15:10
Essentially yes, getting the serial and distributed versions to produce exactly the same output would be difficult, the overhead to get it setup might even be greater than the time saved by running in parallel, at least on small data sets. — Matt, Aug 18 '15 at 15:13
@Matt yes, I am using MKL in both cases. Not really, but I am spawning 4 processes, not threads. Would you like me to check it nevertheless? I see, but for sure I would like to see an answer of yours, which I feel that it will guide me on how to study the numerical error between serial and distributed versions. — gsamaras, Aug 18 '15 at 15:13
It is not critical to check `MKL_NUM_THREADS` I will need to review some documentation to check which portions of the MKL that has the biggest impact on, and the precedence when using MPI with MKL, though ultimately none of that would change complications in getting the serial and distributed computations to produce the exact same results. I will post an answer to compute the error between these two results shortly. — Matt, Aug 18 '15 at 15:19

score 4 · Accepted Answer · edited Aug 19 '15 at 14:21

4

Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.

Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.

In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...

edited Aug 19 '15 at 14:21

gsamaras

71,951
46
188
305

answered Aug 18 '15 at 14:16

High Performance Mark

77,191
7
105
161

So, Mark, there is basically nothing I can do with MKL, I will just have to accept that I will get that minor difference, right? – gsamaras Aug 18 '15 at 14:59
Yes, but don't blame MKL, this is a feature of easy parallel numerical computing. It's difficult to eliminate the differences, so difficult that it is not generally worth the effort. – High Performance Mark Aug 18 '15 at 15:08
I see Mark. So I should study the numerical error I am getting? Can you give any tip on that direction? – gsamaras Aug 18 '15 at 15:10
1

The analysis of the numerical errors arising from parallel computation is an extremely complex field. Either study in depth, much more depth that would be possible to explain in an SO answer, or do what most people do and ignore the problem. Until your drill bit gets to 1000m below the mud line and you hit dust ... – High Performance Mark Aug 18 '15 at 15:14

Matt · Answer 2 · 2015-08-20T13:10:36.123

2

As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As @HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.

#include <iostream>
#include <cmath>

double calc_error(double a,double x)
{
  return std::abs(x-a)/std::abs(a);
}
int main(void)
{
  double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
  double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
  double err[5];
  std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
  for (int it=0; it<5; it++) {
    err[it]=calc_error(sans[it], pans[it]);
    std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
  }
return 0;
}

Which produces this output:

Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0

As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.

When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.

Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.

In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

edited Aug 20 '15 at 13:10

answered Aug 18 '15 at 15:39

Matt

545
3
16

Damn it Matt, such a good answer and now I do not know which one to accept. The other answer talks exactly about my post, while yours suggests the next step, which should be the case. I mean that operation really reminds me of stuff that we were taught in the university. So, I can do what your example illustrated and would it be nice to sum up every entry of the `err` array and divide it by 5, producing the average error? – gsamaras Aug 18 '15 at 16:01
That seems like it would give average error, though in this case there are some visual tools that might be useful to show how close the two values are. CERN has developed a data processing and visualization kit that is great for huge data sets and has a lot of error calculation that is more advanced than this example. it is called [ROOT](https://root.cern.ch) and it is open-source and C++ based. I have used it extensively when looking at performance data, and while it is probably overkill in this case it is a remarkably powerful tool that could be useful for analyzing your data – Matt Aug 18 '15 at 16:08
I will admit that the other answer is more applicable to the problem that you posted @gsamaras, and my example is just a way of finding error between the two algorithms. There are a lot of ways this error detection can be used, in addition to the average error you can look at error values that are outside of thresholds you determine, as well as a variety of other outlier detection techniques that are more complex. If you would like I can add more to my answer or migrate it to a different question. Ultimately Mark posted a more direct answer that is probably more acceptable than this one – Matt Aug 18 '15 at 16:30
Hey Matt, sorry for my silence, I am studying the relevant book from the uni. Your example uses the absolute relative error, which is something I will definitely use! ROOT seems nice, but it's too much, as you said! Yes, yes please, I would like to see into this direction. Would you like me to post a new question, where you could entirely move your answer there and also expand it? – gsamaras Aug 18 '15 at 16:36
A new post might be useful, if you could post a larger data I can make an answer that is a bit more specific and uses data you are familiar with. I use ROOT for a lot of things so I will include some visualizations and different outlier detection techniques . I can also try to get some cluster time to reproduce data using similar a similar environment, but that might take a bit longer. I have some things to take care of later today but I will probably have time to post a more thorough answer later tonight or tomorrow morning – Matt Aug 18 '15 at 16:42
Matt, I made a question (stackoverflow.com/questions/32078505/compute-numerical-error), if you want me to edit the question let me know. You can now delete this answer. For the time being I have accepted an answer, but be sure, that if you do post an answer, I will highly notice it (if you know what I mean!) ;) – gsamaras Aug 18 '15 at 17:23
I had some things to work on yesterday and wasn't able to start a good answer, but it looks like you got some pretty thorough responses. I am making an update to this answer to include some info about ways to validate an answer from a parallel algorithm. – Matt Aug 19 '15 at 15:48
Matt it's OK, I thought that you had a reason. I think that it would be better to place the answer to the other question. However, this depends on your edit, thus go ahead and make the decision to where you should post your answer, not for me, but for the future users. – gsamaras Aug 19 '15 at 16:02
Matt, please let me know when you are done with your answer (you may already have, I just want to be sure). – gsamaras Aug 19 '15 at 17:25
Unfortunately I do not have time today to add a significant amount of detail. If you think it would be useful I can expand on how data from multiple runs can be used with histograms and fitting examples to show the statistical moments, but it would probably take a little while to get everything just exactly perfect, and I believe that there are already some more general examples of that sort of thing. – Matt Aug 19 '15 at 18:55
Matt I see. However, I still think it would be nice to move your answer to my other question, so that I can accept it, because now, I only gave you a +1. And among the other answers in my other question, I prefer yours. I leave that in your hands, if you want. Cheers+thanks again! – gsamaras Aug 20 '15 at 10:28

Fix arithmetic error in distributed version

2 Answers2

Linked

Related