0

I'm not sure how to use the physical IDs that I've retrieved from hwloc. For example, I see the following, but if I place rank 0=10.0.2.4 slot=1:8 in the rankfile I get an error that 8 does not exist. However, rank 0=10.0.2.4 slot=1:p8 runs with no problem, but I don't know if I've referenced PU#8 or Core P#8. How do I bind to either a specific core or a specific hardware thread using the rankfile? Is there some way that I can debug this?

[hamiltont@4 latency]$ hwloc-ls -p
Machine (36GB)
  NUMANode P#0 (18GB) + Socket P#1 + L3 (12MB)
    L2 (256KB) + L1 (32KB) + Core P#0
      PU P#0
      PU P#12
    L2 (256KB) + L1 (32KB) + Core P#1
      PU P#2
      PU P#14
    L2 (256KB) + L1 (32KB) + Core P#2
      PU P#4
      PU P#16
    L2 (256KB) + L1 (32KB) + Core P#8
      PU P#6
      PU P#18
    L2 (256KB) + L1 (32KB) + Core P#9
      PU P#8
      PU P#20
    L2 (256KB) + L1 (32KB) + Core P#10
      PU P#10
      PU P#22
  NUMANode P#1 (18GB) + Socket P#0 + L3 (12MB)
    L2 (256KB) + L1 (32KB) + Core P#0
      PU P#1
      PU P#13
    L2 (256KB) + L1 (32KB) + Core P#1
      PU P#3
      PU P#15
    L2 (256KB) + L1 (32KB) + Core P#2
      PU P#5
      PU P#17
    L2 (256KB) + L1 (32KB) + Core P#8
      PU P#7
      PU P#19
    L2 (256KB) + L1 (32KB) + Core P#9
      PU P#9
      PU P#21
    L2 (256KB) + L1 (32KB) + Core P#10
      PU P#11
      PU P#23

I see this question as pretty close to what I'm asking, but not quite the same.

Community
  • 1
  • 1
Hamy
  • 20,662
  • 15
  • 74
  • 102
  • By running with `rank 0=10.0.2.4 slot=1:p23` with no errors, I've verified that it's referencing the hardware thread ID and not the core, so I don't know how to bind to a core using the physical ID – Hamy Feb 25 '13 at 19:00
  • One weird thing: both `rank 0=10.0.2.4 slot=1:4` and `rank 0=10.0.2.4 slot=1:5` work, even though there are no cores with that physical ID – Hamy Feb 25 '13 at 19:00

1 Answers1

0

So there is a PU numbered 23, but no core with this number, so if there is an error referencing number 23 then MPI is trying to get a core.

# No errors, so this is referencing a PU
rank 1=10.0.2.4 slot=1:p23
# No errors, so we are referencing a PU
rank 1=10.0.2.4 slot=p1:p23
# Error! We might be referencing a core
rank 1=10.0.2.4 slot=p1:23
# No error, we are probably referencing physical socket 1 and physical core 8
rank 1=10.0.2.4 slot=p1:8
Hamy
  • 20,662
  • 15
  • 74
  • 102