2

Super EDIT:

Adding the broadcast step, will result in ncols to get printed by the two processes by the master node (from which I can check the output). But why? I mean, all variables that are broadcast have already a value in the line of their declaration!!! (off-topic image).


I have some code based on this example.

I checked that cluster configuration is OK, with this simple program, which also printed the IP of the machine that it would run onto:

int main (int argc, char *argv[])
{
  int rank, size;

  MPI_Init (&argc, &argv);      /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);        /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);        /* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  // removed code that printed IP address
  MPI_Finalize();
  return 0;
}

which printed every machine's IP twice.


EDIT_2

If I print (only) the grid, like in the example, I am getting for one computer:

Processes grid pattern:
0 1
2 3

and for two:

Processes grid pattern:
Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Unfortunately, I'm not familiar with BLACS. Can your try to distill a minimum example showing your issue? We don't know what `init` does for example, so the problem might be buried in there. – mort Aug 05 '15 at 14:22
  • @mort I updated and now it's all clear to the reader! – gsamaras Aug 05 '15 at 15:49
  • Can you run a simple MPI ping pong test on the two nodes to verify that MPI is configured correctly. If you are using Intel MPI there should be a set of MPI benchmarks called the [IMB](https://software.intel.com/en-us/articles/intel-mpi-benchmarks) that has the test included as well as several other tests. Those tests are a good way to make sure that MPI is working correctly on both nodes. – Matt Aug 05 '15 at 16:01
  • @Matt, check my edit. I am using MPICH2. The configuration is OK. – gsamaras Aug 05 '15 at 16:20
  • Thanks, have you tried adding in the portion from the example that prints out the grid info using an `if(myrow == r && mycol == c)` clause and `cout << myid << " " << flush;` to see what the gird info is? With Intel MPI you can set `I_MPI_DEBUG=5` environment variable to get lots more debug info when an MPI program is launched. it looks like MPICH uses `MPIEXEC_DEBUG` but I am having trouble finding more info about it – Matt Aug 05 '15 at 16:31
  • @Matt really appreciating that you are trying to help me. If we solve this, I will give you upvotes everywhere :P Check my edit, it won't show anything! As for `MPIEXEC_DEBUG`, I have no clue and google didn't help! – gsamaras Aug 05 '15 at 16:42
  • After taking more of a look at the example you were basing of, there is a step where `MPI_Bcast(...)` is used to broadcast the dimensions of the matrix and other input to the nodes before `Cblacs_barrier(...)` is used. Excluding the broadcast on a single node would not be an issue but on multiple nodes it is possible that there could be issues. I will work on testing some variations of the example on my nodes at home, but I can only use tcp since I do not have a high-speed interconnect at home. – Matt Aug 05 '15 at 17:14
  • @Matt adding the broadcast step, will result in `ncols` to get printed by the two processes by the master node (from which I can check the output). But why? I mean, all variables that are broadcast have already a value in the line of their declaration!!! If you have an update from your tests, let me know please. http://www.quickmeme.com/img/a2/a20a0adec6f4b2d9822846381f9aa637b8429d3a0464d0a6b7d84d81accb32b5.jpg – gsamaras Aug 05 '15 at 22:45
  • I realized that I don't have the access to the MKL on my home servers so I am working on getting both configured with appropriate packages.I should be able to get access to a small cluster with Intel tools sometime tomorrow that I can test on. – Matt Aug 06 '15 at 00:43
  • @Matt thanks for keeping up with the post! I got the student version of Intel® Parallel Studio XE here: https://software.intel.com/en-us/academic/swdevtools – gsamaras Aug 06 '15 at 00:46
  • @Matt solved. :D :D I was a week in a Greek island and I was debugging..correct code!!! Check my answer. Thanks for your help, which helped me move my thought till I found the solution. I will upvote two answers of yours I like, as a movement of saying thanks! – gsamaras Aug 07 '15 at 16:07
  • 1
    I am glad you were able to work out the issue, cluster configuration and management involve a lot of moving parts, if you plan on doing a lot of work in a cluster environment I would suggest using some sort of cluster management and provisioning tool like [warewulf](http://warewulf.lbl.gov/trac) or [ROCKS](http://www.rocksclusters.org/wordpress/), they can make many aspects of cluster management a lot easier. There are also a variety of reference designs available that use both tools and go through the entire process of building a cluster with known good configuration. – Matt Aug 07 '15 at 16:15

1 Answers1

1

Executables were not the same in both nodes!


When I configured the cluster, I had a very hard time, so when I had a problem with mounting, I just skipped it. So now, the changes would appear only in one node. The code would behave weirdly, unless some (or all) of the code was the same.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thanks for the up-votes, if you have any questions about setting up and maintaining a cluster the [Server Fault](http://serverfault.com) community is very knowledgeable on many of the related topics. If you think it would be helpful I can write a basics of setting up and maintaining a cluster guide and post it somewhere. – Matt Aug 10 '15 at 16:43
  • Well @Matt, I did up-vote stuff I liked! That's very kind of you, but I just wanted that for a uni project, so it's almost done, thus I won't test your article (I won't have access to the cluster), but if you want, I can read it. Thanks again! – gsamaras Aug 17 '15 at 10:24