7

Debugging my program on big counts of kernels, I faced with very strange error of insufficient virtual memory. My investigations lead to peace of code, where master sends small messages to each slave. Then I wrote small program, where 1 master simply send 10 integers with MPI_SEND and all slaves receives it with MPI_RECV. Comparison of files /proc/self/status before and after MPI_SEND showed, that difference between memory sizes is huge! The most interesting thing (which crashes my program), is that this memory won't deallocate after MPI_Send and still take huge space.

Any ideas?

 System memory usage before MPI_Send, rank: 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   251400 kB                                                                                 
VmSize:   186628 kB                                                                                 
VmLck:        72 kB                                                                                  
VmHWM:      4068 kB                                                                                  
VmRSS:      4068 kB                                                                                  
VmData:    71076 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       148 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3                                                                                          

 System memory usage after MPI_Send, rank 0
Name:   test_send_size                                                                                
State:  R (running)                                                                                  
Pid:    7825                                                                                           
Groups: 2840                                                                                        
VmPeak:   456880 kB                                                                                 
VmSize:   456872 kB                                                                                 
VmLck:    257884 kB                                                                                  
VmHWM:    274612 kB                                                                                  
VmRSS:    274612 kB                                                                                  
VmData:   341320 kB                                                                                 
VmStk:        92 kB                                                                                  
VmExe:       604 kB                                                                                  
VmLib:      6588 kB                                                                                  
VmPTE:       676 kB                                                                                  
VmSwap:        0 kB                                                                                 
Threads:    3        
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
vovo
  • 397
  • 1
  • 3
  • 15
  • 2
    Which MPI implementation are you using, and what kind of network is this on? – Greg Inozemtsev Oct 26 '12 at 14:44
  • 2
    It's Friday and most regular SO visitors have long but depleted their mana and are unable to guess things that you do not provide in your question. From the amount of registered memory I would guess that you are running over InfiniBand or another RDMA-enabled network. From the size of the data segment and the amount of registered memory I would also guess that you are not reusing the same buffer for all send operations but rather constantly allocating a new one. Please, prove me wrong by telling us what MPI library you use and showing us your sender's source code. – Hristo Iliev Oct 26 '12 at 15:08
  • MPI: impi-4.0.3 COMPILER: intel-13.0 – vovo Oct 26 '12 at 15:08
  • QDR Infiniband 4x/10G Ethernet/Gigabit Ethernet – vovo Oct 26 '12 at 15:10

2 Answers2

10

This is an expected behaviour from almost any MPI implementation that runs over InfiniBand. The IB RDMA mechanisms require that data buffers should be registered, i.e. they are first locked into a fixed position in the physical memory and then the driver tells the InfiniBand HCA how to map virtual addresses to physical memory. It is very complex and hence very slow process to register memory for usage by the IB HCA and that's why most MPI implementations never unregister memory that was once registered in hope that the same memory would later be used as a source or data target again. If the registered memory was heap memory, it is never returned back to the operating system and that's why your data segment only grows in size.

Reuse send and receive buffers as much as possible. Keep in mind that communication over InfiniBand incurrs high memory overhead. Most people don't really think about this and it is usually poorly documented, but InfiniBand uses a lot of special data structures (queues) which are allocated in the memory of the process and those queues grow significantly with the number of processes. In some fully connected cases the amount of queue memory can be so large that no memory is actually left for the application.

There are certain parameters that control IB queues used by Intel MPI. The most important in your case is I_MPI_DAPL_BUFFER_NUM which controls the amount of preallocated and preregistered memory. It's default value is 16, so you might want to decrease it. Be aware of possible performance implications though. You can also try to use dynamic preallocated buffer sizes by setting I_MPI_DAPL_BUFFER_ENLARGEMENT to 1. With this option enabled, Intel MPI would initially register small buffers and will later grow them if needed. Note also that IMPI opens connections lazily and that's why you see the huge increase in used memory only after the call to MPI_Send.

If not using the DAPL transport, e.g. using the ofa transport instead, there is not much that you can do. You can enable XRC queues by setting I_MPI_OFA_USE_XRC to 1. This should somehow decrease the memory used. Also enabling dynamic queue pairs creation by setting I_MPI_OFA_DYNAMIC_QPS to 1 might decrease memory usage if the communication graph of your program is not fully connected (a fully connected program is one in which each rank talks to all other ranks).

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thank you both for complete answers! I am really surprised, because i understand increasing in memory size on 10-20 Mb(1 master, 400 slaves, size of message = 10 integers), but 300 Mb!! And my problem must use thousands kernels...
    So, as a result, my big model starts at 1000 kernels and master distribute some small parameters at initialization phase. And now things that still are njt clear for me - (1) Why memory increasing so huge? Messages are so small! (2) Is this allocation means that now master process has very large buffer for internal MPI purposes untill the end of program?
    – vovo Oct 26 '12 at 16:07
  • 1
    The huge buffers are mostly to win in benchmarks :) People look at the latency numbers and forget about memory usage. Basically, the vendor runs a benchmark and finds that it is faster to copy a message less than X bytes than to do the memory registration. So that becomes the eager buffer size. I guess this issue is now acknowledged: there is the dynamic resize feature that Hristo pointed out. The internal buffers will stay registered until the end of the program. What's worse, they are pinned in physical memory, and can't even be swapped out (hence your error about virtual memory). – Greg Inozemtsev Oct 26 '12 at 16:15
  • (3) What do you mean under 'reuse buffers'? I hope MPI doesnt allocate new memory for every new message-array, that i use in MPI_SEND, am i right? – vovo Oct 26 '12 at 16:18
  • @user1671232 (3) No, the idea is you reuse buffers in your code. If your messages are larger than the eager threshold, then each time you access a new buffer it is registered first (which is slow). Using the same buffer more than once is much faster. But it won't help with small messages. – Greg Inozemtsev Oct 26 '12 at 16:20
  • So, for example, i have two MPI_SEND calls, and to first i pass Array1(1), to second - Array2(1000). Does IMPI allocate two different memory buffers for these arrays? – vovo Oct 26 '12 at 16:29
  • @user1671232, by default the DAPL provider for IMPI preallocates 16 buffers of 16640 bytes each (260 KiB in total) _per connection_. With 400 slaves this results in 102 MiB. The documentation doesn't make it clear if these are used both as send and receive buffers or one has to multiply by two. The latter gives 204 MiB, quite close to 252 MiB in your case. – Hristo Iliev Oct 26 '12 at 16:33
  • @user1671232 No, there are always `I_MPI_DAPL_BUFFER_NUM` buffers per connection. If you run out of them (say there are 16 buffers and you send 17 messages to the same rank in quick succession), MPI will wait for one of the internal buffers to free up. It may also aggregate the messages into a single one. – Greg Inozemtsev Oct 26 '12 at 16:35
  • I tried now both options(`I_MPI_OFA_DYNAMIC_QPS` and `I_MPI_DAPL_BUFFER_ENLARGEMENT`), but no changes.. How can i understand what type of transport is used in my system? Maybe some command or environment variable? – vovo Oct 26 '12 at 16:48
  • You can try to enforce a specific fabric by setting `I_MPI_FABRICS`. You could specify `shm:ofa` in order to use shared memory and OpenFabrics verbs or `shm:dapl` for shared memory and DAPL. With DAPL you can try the UD mode by setting `I_MPI_DAPL_UD` to `enable`. – Hristo Iliev Oct 26 '12 at 19:44
  • Alright, after night i formed 3 last questions for complete understanding. (1) I tried messgaes 10,1000,1000000 message sizes and result was approximately the same. Is it mean, that every time master-process allocate predefined 16 buffers per connections and use only them? (2) At initialization master `MPI_SEND` data, at finalizing phase it will `MPI_RECV` it from same processors. Master will reuse buffers or allocate more? (3) Only option `I_MPI_DAPL_UD=enable` bring super results - memory remains approximately the same. Where is magic? )) – vovo Oct 27 '12 at 06:54
  • UD is the Unreliable Datagram InfiniBand protocol. It is rougly equivalent to UDP. Don't let the name fool you - it is quite reliable in most cases and the library makes retransmissions if necessary. As for the different message sizes, when the size goes above certain threshold the user buffer gets registered directly and data is not copied to the preallocated buffers (ok, part of it is, but I don't want to go into much details). But still those buffers are precreated for each connection, no matter if you are going to send small messages or not. – Hristo Iliev Oct 27 '12 at 09:38
  • So, UD protocol preallocate nothing and directly use my buffers? Because i measure time of run and it is the same. – vovo Oct 27 '12 at 11:14
  • @vovo No, even with UD there will be preallocated buffers. And Hristo is right: there really is no way to get rid of them :) As for your application crashes, how much memory do your nodes have? And what's the setting for `memlock` in `/etc/security/limits.conf`? – Greg Inozemtsev Oct 27 '12 at 15:27
  • In contrary, now everything is working fine, i run my job on 5000 kernels and small part of memory is added after `MPI_SEND`. And run time is the same, as before modificatons, therefore new protocol(`UD`) doesnt affect performance. And my question is - where is magic? )) – vovo Oct 27 '12 at 15:38
  • @vovo I updated my answer with an explanation for this behavior. – Greg Inozemtsev Oct 28 '12 at 18:12
  • @vovo, as an Intel MPI user you may find this presentation, which Michael Klemm from Intel gave recently at one of our tuning workshops, very handy - [23 tips for performance tuning with the Intel® MPI Library](https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/03%20MPI%20Tuning.pdf). – Hristo Iliev Oct 28 '12 at 18:34
5

Hristo's answer is mostly right, but since you are using small messages there's a bit of a difference. The messages end up on the eager path: they first get copied to an already-registered buffer, then that buffer is used for the transfer, and the receiver copies the message out of an eager buffer on their end. Reusing buffers in your code will only help with large messages.

This is done precisely to avoid the slowness of registering the user-supplied buffer. For large messages the copy takes longer than the registration would, so the rendezvous protocol is used instead.

These eager buffers are somewhat wasteful. For example, they are 16kB by default on Intel MPI with OF verbs. Unless message aggregation is used, each 10-int-sized message is eating four 4kB pages. But aggregation won't help when talking to multiple receivers anyway.

So what to do? Reduce the size of the eager buffers. This is controlled by setting the eager/rendezvous threshold (I_MPI_RDMA_EAGER_THRESHOLD environment variable). Try 2048 or even smaller. Note that this can result in a latency increase. Or change the I_MPI_DAPL_BUFFER_NUM variable to control the number of these buffers, or try the dynamic resizing feature that Hristo suggested. This assumes your IMPI is using DAPL (the default). If you are using OF verbs directly, the DAPL variables won't work.


Edit: So the final solution for getting this to run was setting I_MPI_DAPL_UD=enable. I can speculate on the origin of the magic, but I don't have access to Intel's code to actually confirm this.

IB can have different transport modes, two of which are RC (Reliable Connected) and UD (Unreliable Datagram). RC requires an explicit connection between hosts (like TCP), and some memory is spent per connection. More importantly, each connection has those eager buffers tied to it, and this really adds up. This is what you get with Intel's default settings.

There is an optimization possible: sharing the eager buffers between connections (this is called SRQ - Shared Receive Queue). There's a further Mellanox-only extension called XRC (eXtended RC) that takes the queue sharing further: between the processes that are on the same node. By default Intel's MPI accesses the IB device through DAPL, and not directly through OF verbs. My guess is this precludes these optimizations (I don't have experience with DAPL). It is possible to enable XRC support by setting I_MPI_FABRICS=shm:ofa and I_MPI_OFA_USE_XRC=1 (making Intel MPI use the OFA interface instead of DAPL).

When you switch to the UD transport you get a further optimization on top of buffer sharing: there is no longer a need to track connections. The buffer sharing is natural in this model: since there are no connections, all the internal buffers are in a shared pool, just like with SRQ. So there are further memory savings, but at a cost: datagram delivery can potentially fail, and it is up to the software, not the IB hardware to handle retransmissions. This is all transparent to the application code using MPI, of course.

Greg Inozemtsev
  • 4,516
  • 23
  • 26