2

I have a numa machine with two nodes, each node has 8 cores and 64G memory and have two services: service-1 and service-2, service-1 deployed in node-0, and service-2 deployed in node-1.

I just want these two services to run separately, So I start service as followed:

numactl --membind=0 ./service-1

numactl --membind=1 ./service-2

In service code, I use pthread_setaffinity_np to bind threads to corresponding node's cpus(binding service-1 to cpu[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]);

localhost:~$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31


localhost:~$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 64327 MB
node 0 free: 17231 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 64498 MB
node 1 free: 37633 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

What I expected is linux system allocates memory on node-1 for service-2, besides service-2 visits node-1 memory mainly and some dynamic library pages on node-0. But sadly I find service-2 has a lot of memory on remote node(node-0) and visiting it costs too much time.

service-2's numa_maps as followed:

7f07e8c00000 bind:1 anon=455168 dirty=455168 N0=442368 N1=12800
7f08e4200000 bind:1 anon=128000 dirty=128000 N0=121344 N1=6656
7f092be00000 bind:1 anon=21504 dirty=21504 N0=20992 N1=512
7f093ac00000 bind:1 anon=32768 dirty=32768 N0=27136 N1=5632
7f0944e00000 bind:1 anon=82432 dirty=82432 N0=81920 N1=512
7f0959400000 bind:1 anon=1024 dirty=1024 N0=1024
7f0959a00000 bind:1 anon=4096 dirty=4096 N0=4096
7f095aa00000 bind:1 anon=2560 dirty=2560 N0=2560
7f095b600000 bind:1 anon=4608 dirty=4608 N0=4608
7f095c800000 bind:1 anon=512 dirty=512 N0=512
7f095cc00000 bind:1 anon=512 dirty=512 N0=512
...

So here are my questions:

1、Does linux system really do allocate remote memory(node-0) for service-2, regardless of service-2 is already bound to node-1 by membind command?

--membind=nodes, -m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.

2、Is this associated with kernel.numa_balancing, which values 1 in my machine? In my opinion, with kernel.numa_balancing = 1, linux only migrates tasks to the nearest memory or move memory to the nearest node where the task was executed. Since service-2 has already bound to node-1, no balancing would happen?

3、Can someone explain how this remote allocation happened? Is there any methods to avoid this?

Thank you very much !

yskyj
  • 141
  • 1
  • 12
  • The problem could be the node where is running service2 at startup. You could bind service2 to node 1 since startup: `numactl --cpunorebind=1 ./service2`. – Oliv Jul 29 '19 at 08:12
  • Thank you Oliv! That's really possible. But `taskset -c -p ${pid}` shows service-2 is always running on cpus of node-1. – yskyj Jul 29 '19 at 09:10

0 Answers0