Understanding failing numactl with --membind=1 or 3 when all lscpu shows 4 numa nodes

Question

I've been trying to figure out the issue with failing numactl command, but it looks like may be i don't fully understand the way numactl or OMP_MP_THREAD works.

I'm trying to run a script main.py of 1 instance bound to 4 cpus of numa-node-1 using numactl --physcpubind=24-27 --membind=1 python -u main.py, as the lscpu shows CPUs 24-27 bound to numa-node-1.

But I get the following error.

libnuma: Warning: node argument 1 is out of range
<1> is invalid

If I use --membind=3 I get the same error, but it runs when I use --membind=2.

Question:

1. For numa-node=0 are each of 0-23 in 0-23,96-119 physical cores or only some of 0-23 are physical cores, as there are 2 threads per core? How to know which ones of ``0-23,96-119` and which are 2nd threads?

2. Am I binding the phys-cores to nodes correctly? Why the above failure?

3. which 2 numa nodes are on socket-0 and which ones are on socket-1?

Outputs:

lscpu:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-191
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Stepping:                        7
Frequency boost:                 enabled
CPU MHz:                         1000.026
CPU max MHz:                     2301,0000
CPU min MHz:                     1000,0000
BogoMIPS:                        4600.00
L1d cache:                       3 MiB
L1i cache:                       3 MiB
L2 cache:                        96 MiB
L3 cache:                        143 MiB
NUMA node0 CPU(s):               0-23,96-119
NUMA node1 CPU(s):               24-47,120-143
NUMA node2 CPU(s):               48-71,144-167
NUMA node3 CPU(s):               72-95,168-191

numactl --hardware:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 64106 MB
node 0 free: 28478 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 2 size: 64478 MB
node 2 free: 45446 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10

A mInor detail which doesn't affect the answer, but OpenMP does not have an OMP_MP_THREAD envirable. Maybe you mean OMP_NUM_THREADS? — Jim Cownie, May 25 '21 at 08:13
Yes I meant OMP_NUM_THREADS. Does OMP_NUM_THREADS=4 mean that only 4 threads for all CPUs bound in that `numactl` command? — Joe Black, May 25 '21 at 17:00
What if CPU has hyperthreading (like in this case every phys CPU has 2 threads), does that 4 include the hyperthreads or without hyperthreads? — Joe Black, May 25 '21 at 17:00
Any idea what distances of `10` and `21` mean in the output of `numactl --hardware`? — Joe Black, May 25 '21 at 17:14
OMP_NUM_THREADS is utterly unconcerned with any hardware properties at all. It tells the OpenMP runtime system how many software threads to create, and that's it. If you ask for 42 threads (or 420) that is what it will try to create even if you're running on a single core, no SMT machine. — Jim Cownie, May 26 '21 at 10:11
So if one wants to fully utilize the cores in this case they should run with OMP_NUM_THREADS=2 *physical_cores_used? — Joe Black, May 26 '21 at 16:06
E.g. if want to use 4 cores on node-0, then should specify `--physcpubind=0-3,96-99` and NOT just `--physcpubind=0-3`, true? — Joe Black, May 26 '21 at 16:08
With this command would be `--physcpubind=0-3,96-99 --membind=0` and OMP_NUM_THREADS=8 (not 4 because there to fully use 4 phy cores, need to allow 8 threads)? — Joe Black, May 26 '21 at 16:08
The best thing to do for OPenMP is...nothing. Don't set OMP_NUM_THREADS at all. All sane OpenMP runtimes create one thread per available logicalCPU by default, where "available" is controlled by external mechanisms (ultimately the sched_{get,set}affinity mask, but at higher levels cpuset, numactl etc which work on top of sched_{set,get}affinity). Setting OMP_NUM_THREADS yourself just gives you an opportunity to get it wrong. — Jim Cownie, May 28 '21 at 09:27

Gilles · Answer 1 · 2021-05-26T06:56:38.217

The issue here is that some of your NUMA nodes aren't populated with any memory. You can see that with the output of the numactl --hardware command which shows a size of 0 for the memory on nodes 1 and 3. Therefore, trying to bind the memory to these nodes is a lost battle...

Jus a side note: 9242 CPUs are normally (well, AFAIK) only available with welded-on memory modules, so it is very unlikely that there are missing memory DIMMS on your machine. So either there's something very wrong at the hardware level for your machine, or there's a layer of virtualization of a sort which hides part of the memory to you. Either way, the configuration is very wrong and needs to be investigated deeper.

EDIT: Answering the extra questions

Physical core vs. HW threads numbering: when hyperthreading is enabled, there's no actual numbering of the physical core anymore. All cores seen by the OS are actually HW threads. Simply, in your case here, physical core 0 is seen as the 2 logical cores 0 and 96. Physical core 1 is seen as logical cores 1 and 97, as so on...
Numactl failure: already answered
NUMA node numbering: generally speaking, it depends on the BIOS of the machine. So there are 2 main options for numbering, when you have N physical sockets on a node with P cores each. These 2 options are the following (naming is mine, not sure if there's an official one):
1. Spreading:
  - Socket 0: cores 0, N, 2N, 3N, ..., (P-1)N
  - Socket 1: cores 1, N+1, 2N+1, ..., (P-1)N+1
  - ...
  - Socket N-1: cores N-1, 2N-1, ..., PN-1
2. Linear:
  - Socket 0: cores 0, 1, ..., P-1
  - Socket 1: cores P, P+1, ..., 2P-1
  - ...
  - Socket N-1: cores (N-1)P, ..., NP-1
And if the Hyperthreading is activated, you just add P cores per socket, and number them so that cores numbered C and C+PN are actually the 2 HW threads of the same physical core.

In your case here, you are seeing linear numbering
numactl --physcpubind=0-3: this restrains the range of logical cores the command you launched is allowed to be scheduled on to the list passed in parameter, namely cores 0, 1, 2 and 3. But that doesn't force the code you launched to use more than one core at a time. For OpenMP codes, you still need to set the OMP_NUM_THREADS environment variable for that.
OMP_NUM_THREADS and HT: OMP_NUM_THREADS only tells how many threads to launch, it doesn't care about cores, should they ne physical or logical.
Distance reported by numactl: I'm not too sure of the exact meaning / accuracy of the values reported, but here is how I interpret them when I need it: for me it corresponds to some relative memory access latencies. I don't know if the values are measured or just guessed, and if these are cycles or nano seconds, but here is what it says:
- Cores from NUMA node 0 have an access latency to memory attached to NUMA node 0 of 10 and of 21 to all other NUMA nodes
- Cores from NUMA node 1 have an access latency to memory attached to NUMA node 1 of 10 and of 21 to all other NUMA nodes
- etc
  But the crucial point is that accessing distance memory is 2.1 times longer than accessing local one

So when I run `numactl --physcpubind=0-3 --membind=0 python -u main.py` how many physc-cpus is it using? 4 or 2 physical and 4 logical? how does one know which of `NUMA node0 CPU(s): 0-23,96-119` are physical and logical cores due to hyperthreading? — Joe Black, May 25 '21 at 17:12
Does this command also need `OMP_NUM_THREADS=4` to make sure that only 4 cpus are used or is it enough by itself? The same question regarding HT (hyperthreading) occurs for `OMP_NUM_THREADS=4` -- it's not clear if it's 4 with HT or just physical cores. — Joe Black, May 25 '21 at 17:13
Any idea what distances of `10` and `21` mean in the output of `numactl --hardware`? ```node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 ``` — Joe Black, May 25 '21 at 17:14
so if "OMP_NUM_THREADS only tells how many threads to launch", then in this case with HT on, to fully utilize the cores they should run with OMP_NUM_THREADS=2 *physical_cores_to_use? So to launch 1 instance of code on 4 physical cores, command would be `--physcpubind=0-3,96-99 --membind=0` and OMP_NUM_THREADS=8 (not 4 because there to fully use 4 phy cores, need to allow 8 threads)? — Joe Black, May 26 '21 at 16:13
also, if want to use 4 cores on node-0, then should specify `--physcpubind=0-3,96-99` and NOT just `--physcpubind=0-3`, true? — Joe Black, May 26 '21 at 16:13
I thought openMP code/variables only applies when one is using the MPI APIs/libraries and some systems may not be using them, and writing _explicit_ MPI threads in the code using MPI libs. Or do all runtime systems automatically do thread control using openMP system/standard? — Joe Black, May 26 '21 at 16:17

Understanding failing numactl with --membind=1 or 3 when all lscpu shows 4 numa nodes

Question:

1 Answers1