3

I'm trying to find some memory errors in a program of mine using electric fence. My program uses OpenMPI and when I try to run it, it segfaults with the following back trace:

Program received signal SIGSEGV, Segmentation fault.
2001    ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: No such file or directory.
__memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
(gdb) bt
#0  __memcpy_ssse3_back ()
    at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
#1  0x00007ffff72d6b7f in ompi_ddt_copy_content_same_ddt ()
   from /usr/lib/libmpi.so.0
#2  0x00007ffff72d4d0d in ompi_ddt_sndrcv () from /usr/lib/libmpi.so.0
#3  0x00007ffff72dd5b3 in PMPI_Allgather () from /usr/lib/libmpi.so.0
#4  0x00000000004394f1 in ppl::gvec<unsigned int>::gvec (this=0x7fffffffdd60, 
    length=1) at qppl/gvec.h:32
#5  0x0000000000434a35 in TreeBuilder::TreeBuilder (this=0x7fffffffdc60, 
    octree=..., mygbodytab=..., mylbodytab=..., cellpool=0x7fffef705fc8, 
---Type <return> to continue, or q <return> to quit---
    leafpool=0x7fffef707fc8, bodypool=0x7fffef6bdfc0) at treebuild.cxx:93
#6  0x000000000042fb6b in BarnesHut::BuildOctree (this=0x7fffffffde50)
    at barnes.cxx:155
#7  0x000000000042af52 in BarnesHut::Run (this=0x7fffffffde50)
    at barnes.cxx:386
#8  0x000000000042b164 in main (argc=1, argv=0x7fffffffe118) at barnes.cxx:435

The relevant portion of my code is:

   me = spr_locale_id();
   world_size = spr_num_locales();
   my_elements = std::shared_ptr<T>(new T[1]);

   world_element_pointers = std::shared_ptr<T*>(new T*[world_size]);

   MPI_Allgather(my_elements.get(), sizeof(T*), MPI_BYTE,
       world_element_pointers.get(), sizeof(T*), MPI_BYTE,
       MPI_COMM_WORLD);

I'm not sure why __memcpy_ssse3_back is causing a segfault. This part of the program doesn't segfault when I run without electric fence. Does anyone know what's going on? I'm using openmpi version 1.4.3

Kurtis Nusbaum
  • 30,445
  • 13
  • 78
  • 102
  • Your code makes no sense to me - you send a total of 4 or 8 data bytes (depending on what the size of a pointer in your architecture is) from the local `my_elements` array and gather all these bytes into an array of pointers?! Never mind, it would help if you could provide at least the version of the Open MPI library being used. – Hristo Iliev Jan 17 '13 at 11:28
  • @HristoIliev I've updated my answer to include the version of OpenMPI I'm using. I'm transferring my pointers around because I'm going to be doing RDMAs later and worst case is everyone has to know everyone else's starting address. – Kurtis Nusbaum Jan 18 '13 at 18:28
  • Now it makes much more sense. This is a know bug in older Open MPI versions - here is the [ticket](https://svn.open-mpi.org/trac/ompi/ticket/1903). I would recommend that you install a newer version of Open MPI or use the patch in the ticket to patch your 1.4.3 version. – Hristo Iliev Jan 18 '13 at 18:51
  • @HristoIliev you should post that as an answer so that I can mark is at correct and give you credit. – Kurtis Nusbaum Jan 18 '13 at 20:23

1 Answers1

4

There are two possible reasons for the error:

There is a bug in the data copy routines, present in older Open MPI versions, that appears to have been fixed in version 1.4.4. If this is the case, an upgrade of the Open MPI library to a newer version would solve the problem.

Another possible reason is that my_elements is an array of single item of type T. In the MPI_Allgather call you pass a pointer to this element, but you specify instead sizeof(T*) as the number of bytes to be sent. By default, Electric Fence places the newly allocated memory at the end of a memory page and then inserts an inaccessible memory page immediately after. If T happens to be shorter than a pointer type (e.g. T is int and you are running on a 64-bit LP64 platform), then access to the inaccessible memory page would occur and hence the segfault. As your intention is to actually send a pointer to the data, then you should pass MPI_Allgather a pointer to the value returned by my_elements.get() instead.

By the way, passing pointers around is not a nice thing to do. MPI provides its own portable RDMA implementation. See the One-sided Communications chapter of the MPI standard. It is a bit cumbersome, but it should at least be portable.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • 1
    The issue was I was sending T* when I needed to be sending a T**. And I would totally use MPI's RDMA if my purpose wasn't to use a new experimental type of RDMA :) Thanks for all your help! – Kurtis Nusbaum Jan 22 '13 at 19:46