How is the bootstrap processor (BSP) selected on Intel ring and mesh architectures

Question

Section 2.13.2 mentions that the arbitration ID is used to determine which processor issues the no-op cycle first and I have seen this on multiple sources and the intel manual. The intel manual that references the MP initialisation sequence only addresses Pentium 4 when when there was a 'system bus' and before that there was originally an 'APIC bus'. I am under the impression that arbitration ID was only needed in those architectures where multiple cpus shared the same bus. But now, with the ring bus architecture, arbitration is done by sensing an empty slot on the ring bus and placing the transaction on it and it moves round at one stop per cycle meaning arbitration is no longer required.

What's interesting is Section 2.13.2 is part of a document that speaks about Intel ME and the PCH, so it is obviously speaking about Nehalem and recent but to say that the APIC ArbID is used, perhaps it is indeed only talking about Nehalem or Westmere.

So I ask, how is the BSP selected on ring and indeed mesh architectures? My thought was that it could use cache as RAM and if cache coherency does function in no fill mode then they could race for a mutex

score 3 · Accepted Answer · answered Mar 23 '19 at 21:42

3

I assume it's just hard-wired that one of the cores is the BSP. I don't think they other cores even power up until you send them an IPI, and they certainly wouldn't be running code that tries to take a mutex in cache to sort this out. The other cores probably come up in a HALT-like state that waits for an interrupt.

(But probably a deep sleep C-state like C7 or something, unlike the actual HALT instruction, so if the OS never wakes up some of the cores, putting the woken cores to sleep can let the whole package go into a deep sleep state.)

For multi-socket systems, presumably one socket is special somehow.

answered Mar 23 '19 at 21:42

Peter Cordes

328,167
45
605
847

I suppose it could be hardwired such that core 0 starts executing from the reset vector and all other cores wait in an INIT IPI state. I'm probably just going to have to accept it as what happens because it doesn't really matter I guess. – Lewis Kelsey Mar 25 '19 at 11:09
A waiting-for- INIT IPI state * – Lewis Kelsey Mar 25 '19 at 15:55
1

I don't think hard-wiring the BSP works. My understanding is that we don't even know which of the physical cores will be working properly until after fabricating the chip (and marketing it as such). Otherwise, the hard-wired core might be the faulty one. So there is a need for a boot-time mechanism to choose one of the (properly working) logical cores as the BSP. I also don't understand the question. There can be many transactions on the ring or mesh at any point in time. I'm not sure why @LewisKelsey thinks that the arbitration ID is not needed. – Hadi Brais Mar 29 '19 at 09:02
@HadiBrais: When I said "hard wire", I meant after fusing off non-working cores. It doesn't have to be purely run-time, it can be set in the factory with the same mechanism they use to alter circuits, like a laser or something. Or maybe there's some hardware logic that selects the first non-fused-off core. But there are *lots* of metal layers above the silicon that could be altered based on which cores are working. – Peter Cordes Mar 29 '19 at 09:07
Also on multi-socket system, we don't know at design-time which of the sockets contain chips and which don't. The user might install the CPU on any of the sockets and on only one of them (or all of them). – Hadi Brais Mar 29 '19 at 09:09
Right, that is one possible way for choosing the BSP, but the algorithm mentioned in the Intel manual Section 8.4 applies to all Intel processors (according to the manual as clearly stated). – Hadi Brais Mar 29 '19 at 09:10
@HadiBrais: I wondered about multi-socket but didn't check. Isn't it normal that there's one socket that must be populated, if you're going to under-populate a board? I was guessing that this socket (if my guess is right) got a hard-wired signal to tell it that one of its cores should be the BSP. Otherwise I guess you could have the sockets negotiate with some kind of special protocol over QPI to elect a BSP, and have the others go back to sleep. – Peter Cordes Mar 29 '19 at 09:17
1

Good point. I looked at the [manual](https://www.supermicro.com/products/motherboard/QPI/5500/X8DTN_.cfm?IPMI=O) of Supermicro X8DTN, which is a dual socket motherboard. The tables for optimal memory population shown on page 30 and 31 indicate that any of the two sockets can be populated or both. So yeah I think there needs to be some negotiation between the sockets (if there is more than one). – Hadi Brais Mar 29 '19 at 09:31
@HadiBrais why would you need an arbitration ID on the ring bus. It's my understanding that the ring bus doesn't require arbitration or request / grant mechanism. The only contention at the stop is between the LLC slice and the core and they have to negotiate what will put what on what ring and in what direction depending on what slot on what ring in what direction is available and what direction is quicker. – Lewis Kelsey Mar 29 '19 at 16:31
1

@LewisKelsey Let's very carefully read how Section 8.4.3 of Volume 3 describes the BSP selection process. First, each processor is assigned a unique APIC ID (each logical core has a local APIC). Second, each logical processor is assigned an arbitration priority based on the APIC ID (which could be equal to the APIC ID). Third, each processor executes the built-in self test. Fourth, on modern Intel processors, **each logical processor issues a NOP Special Cycle on the system bus**. What does this mean?... – Hadi Brais Mar 29 '19 at 19:13
1

...The Special Cycle is a type of transaction that is handled by the Ubox unit in the system agent as I'll describe shortly. In modern Intel processors, the system bus is basically the QPI interconnect. So putting it together, this means that each logical core issues a Special Cycle request to the Ubox and each such request is tagged with the arbitration priority of the logical core. The Ubox receives all of these requests (from the internal cores of the respective socket). Each Ubox chooses the request with the highest priority and then itself arbitrates for the QPI bus master lock... – Hadi Brais Mar 29 '19 at 19:13
Then one of the Uboxes acquires the master lock and broadcasts the selected request to all logical cores in the system (including the core that sent the request). Each core then receives the request and examines the ID associated with it. If it is equal to its ID, then it sets the BSP flag in its IA32_APIC_BASE MSR, indicating that it is the BSP processor, and then fetches and executes the firmware bootstrap code. Otherwise, it resets its BSP flag, indicating that it is an AP processor, and then enters a wait-for-SIPI state... – Hadi Brais Mar 29 '19 at 19:13
...So the arbitration for the BSP happens really at the Ubox. That said, each router on the ring/mesh also includes arbitration logic, but this logic is just for routing and has nothing to do with BSP selection. This is my understanding of Section 8.4.3. – Hadi Brais Mar 29 '19 at 19:14
@HadiBrais: those last few comments look like a better answer than my guesses + hand-waving. – Peter Cordes Mar 29 '19 at 20:18
@HadiBrais So the arbitration ID is still useful for this process I guess. This was a very interesting answer. I didn't know what the Ubox did but I had seen it on diagrams. – Lewis Kelsey Mar 31 '19 at 11:45
https://github.com/RRZE-HPC/likwid/wiki/BroadwellEP one of its functions potentially appears to be to translate MSIs to a LAPIC or ring bus understandable form. Perhaps the Ubox accepts the FEEh range..? What else do you know about it? – Lewis Kelsey Mar 31 '19 at 12:53
1

@LewisKelsey The Ubox is discussed in the Intel uncore manuals. The LIKWID documentation on the Ubox is actually from the manuals. – Hadi Brais Mar 31 '19 at 17:25

How is the bootstrap processor (BSP) selected on Intel ring and mesh architectures

1 Answers1

Linked