40

According to MSDN documentation and Stephen Toub answer, my C# app should use every Logical Processor of every Processor Group because it is configured as required (see my App.config below).

I run my app on a windows server 2012 with a NUMA architecture: 2 x [cpu Xeon E5-2697 v3 at 14 cores each with Hyper Thread activated] => 2 x 14 x 2 = 56 Logical Processors.

My app start 80 threads either from "Thread Class" or "Parallel.For" and in both case it only takes 28 Logical Processors, all from the same Processor Group.

Why does the Task scheduler assign my threads on only one Processor Group?

My code is available at GitHub or the executable could be downloaded at my Home website

I've already asked this question on social.msdn.microsoft.com without any answers.

  • Update 2015-01-26: I reported a bug at connect.microsoft.com

  • Update 2015-01-30: I added CoreInfo dump as additional references.

  • Update 2015-01-30: The problem occurs also with prime95 where it only offer to select 28 threads (not c# related)

  • Update 2015-01-30: My tool now show more information like Processor Mask per node. It sounds like I do not have access to the other node (the node I do not run in)

  • Update 2015-02-02, We do NOT have Citrix installed on this particular server (sorry, I was wrong)

  • Update 2015-02-02, We contacted HP...

  • Update 2015-02-03, Added more information to my program to display processorGroup per thread and few more little gadgets.

  • Update 2015-02-17, After talked to HP dev team, I updated my answer to reflect what I learned. (Just want to mention that I received EXCELLENT support from HP)

  • Update 2015-05-13, HP confirmed the problem in a "Customer Advisory" note. See this linked document id: c04650594

I set my .Net 4.5 (or 4.5.1) App.Config to:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <runtime>
        <Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
        <GCCpuGroup enabled="true"></GCCpuGroup>
        <gcServer enabled="true"></gcServer>
    </runtime>
    <startup> 
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5.1"/>
    </startup>
</configuration>

This is the dump of CoreInfo from Microsoft:

Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
Microcode signature: 00000023
HTT         *   Hyperthreading enabled
HYPERVISOR  -   Hypervisor is present
VMX         *   Supports Intel hardware-assisted virtualization
SVM         -   Supports AMD hardware-assisted virtualization
X64         *   Supports 64-bit mode

SMX         *   Supports Intel trusted execution
SKINIT      -   Supports AMD SKINIT

NX          *   Supports no-execute page protection
SMEP        *   Supports Supervisor Mode Execution Prevention
SMAP        -   Supports Supervisor Mode Access Prevention
PAGE1GB     *   Supports 1 GB large pages
PAE         *   Supports > 32-bit physical addresses
PAT         *   Supports Page Attribute Table
PSE         *   Supports 4 MB pages
PSE36       *   Supports > 32-bit address 4 MB pages
PGE         *   Supports global bit in page tables
SS          *   Supports bus snooping for cache operations
VME         *   Supports Virtual-8086 mode
RDWRFSGSBASE    *   Supports direct GS/FS base access

FPU         *   Implements i387 floating point instructions
MMX         *   Supports MMX instruction set
MMXEXT      -   Implements AMD MMX extensions
3DNOW       -   Supports 3DNow! instructions
3DNOWEXT    -   Supports 3DNow! extension instructions
SSE         *   Supports Streaming SIMD Extensions
SSE2        *   Supports Streaming SIMD Extensions 2
SSE3        *   Supports Streaming SIMD Extensions 3
SSSE3       *   Supports Supplemental SIMD Extensions 3
SSE4a       -   Supports Streaming SIMDR Extensions 4a
SSE4.1      *   Supports Streaming SIMD Extensions 4.1
SSE4.2      *   Supports Streaming SIMD Extensions 4.2

AES         *   Supports AES extensions
AVX         *   Supports AVX intruction extensions
FMA         *   Supports FMA extensions using YMM state
MSR         *   Implements RDMSR/WRMSR instructions
MTRR        *   Supports Memory Type Range Registers
XSAVE       *   Supports XSAVE/XRSTOR instructions
OSXSAVE     *   Supports XSETBV/XGETBV instructions
RDRAND      *   Supports RDRAND instruction
RDSEED      -   Supports RDSEED instruction

CMOV        *   Supports CMOVcc instruction
CLFSH       *   Supports CLFLUSH instruction
CX8         *   Supports compare and exchange 8-byte instructions
CX16        *   Supports CMPXCHG16B instruction
BMI1        *   Supports bit manipulation extensions 1
BMI2        *   Supports bit manipulation extensions 2
ADX         -   Supports ADCX/ADOX instructions
DCA         *   Supports prefetch from memory-mapped device
F16C        *   Supports half-precision instruction
FXSR        *   Supports FXSAVE/FXSTOR instructions
FFXSR       -   Supports optimized FXSAVE/FSRSTOR instruction
MONITOR     *   Supports MONITOR and MWAIT instructions
MOVBE       *   Supports MOVBE instruction
ERMSB       *   Supports Enhanced REP MOVSB/STOSB
PCLMULDQ    *   Supports PCLMULDQ instruction
POPCNT      *   Supports POPCNT instruction
LZCNT       *   Supports LZCNT instruction
SEP         *   Supports fast system call instructions
LAHF-SAHF   *   Supports LAHF/SAHF instructions in 64-bit mode
HLE         -   Supports Hardware Lock Elision instructions
RTM         -   Supports Restricted Transactional Memory instructions

DE          *   Supports I/O breakpoints including CR4.DE
DTES64      *   Can write history of 64-bit branch addresses
DS          *   Implements memory-resident debug buffer
DS-CPL      *   Supports Debug Store feature with CPL
PCID        *   Supports PCIDs and settable CR4.PCIDE
INVPCID     *   Supports INVPCID instruction
PDCM        *   Supports Performance Capabilities MSR
RDTSCP      *   Supports RDTSCP instruction
TSC         *   Supports RDTSC instruction
TSC-DEADLINE    *   Local APIC supports one-shot deadline timer
TSC-INVARIANT   *   TSC runs at constant rate
xTPR        *   Supports disabling task priority messages

EIST        *   Supports Enhanced Intel Speedstep
ACPI        *   Implements MSR for power management
TM          *   Implements thermal monitor circuitry
TM2         *   Implements Thermal Monitor 2 control
APIC        *   Implements software-accessible local APIC
x2APIC      *   Supports x2APIC

CNXT-ID     -   L1 data cache mode adaptive or BIOS

MCE         *   Supports Machine Check, INT18 and CR4.MCE
MCA         *   Implements Machine Check Architecture
PBE         *   Supports use of FERR#/PBE# pin

PSN         -   Implements 96-bit processor serial number

PREFETCHW   *   Supports PREFETCHW instruction

Maximum implemented CPUID leaves: 0000000F (Basic), 80000008 (Extended).

Logical to Physical Processor Map:
Physical Processor 0 (Hyperthreaded):
**------------------------------------------------------
Physical Processor 1 (Hyperthreaded):
--**----------------------------------------------------
Physical Processor 2 (Hyperthreaded):
----**--------------------------------------------------
Physical Processor 3 (Hyperthreaded):
------**------------------------------------------------
Physical Processor 4 (Hyperthreaded):
--------**----------------------------------------------
Physical Processor 5 (Hyperthreaded):
----------**--------------------------------------------
Physical Processor 6 (Hyperthreaded):
------------**------------------------------------------
Physical Processor 7 (Hyperthreaded):
--------------**----------------------------------------
Physical Processor 8 (Hyperthreaded):
----------------**--------------------------------------
Physical Processor 9 (Hyperthreaded):
------------------**------------------------------------
Physical Processor 10 (Hyperthreaded):
--------------------**----------------------------------
Physical Processor 11 (Hyperthreaded):
----------------------**--------------------------------
Physical Processor 12 (Hyperthreaded):
------------------------**------------------------------
Physical Processor 13 (Hyperthreaded):
--------------------------**----------------------------
Physical Processor 14 (Hyperthreaded):
----------------------------**--------------------------
Physical Processor 15 (Hyperthreaded):
------------------------------**------------------------
Physical Processor 16 (Hyperthreaded):
--------------------------------**----------------------
Physical Processor 17 (Hyperthreaded):
----------------------------------**--------------------
Physical Processor 18 (Hyperthreaded):
------------------------------------**------------------
Physical Processor 19 (Hyperthreaded):
--------------------------------------**----------------
Physical Processor 20 (Hyperthreaded):
----------------------------------------**--------------
Physical Processor 21 (Hyperthreaded):
------------------------------------------**------------
Physical Processor 22 (Hyperthreaded):
--------------------------------------------**----------
Physical Processor 23 (Hyperthreaded):
----------------------------------------------**--------
Physical Processor 24 (Hyperthreaded):
------------------------------------------------**------
Physical Processor 25 (Hyperthreaded):
--------------------------------------------------**----
Physical Processor 26 (Hyperthreaded):
----------------------------------------------------**--
Physical Processor 27 (Hyperthreaded):
------------------------------------------------------**

Logical Processor to Socket Map:
Socket 0:
****************************----------------------------
Socket 1:
----------------------------****************************

Logical Processor to NUMA Node Map:
NUMA Node 0:
****************************----------------------------
NUMA Node 1:
----------------------------****************************
Calculating Cross-NUMA Node Access Cost...

Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01
00: 1.0 1.1
01: 1.1 1.1

Logical Processor to Cache Map:
Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**------------------------------------------------------
Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**------------------------------------------------------
Unified Cache       0, Level 2,  256 KB, Assoc   8, LineSize  64
**------------------------------------------------------
Unified Cache       1, Level 3,   35 MB, Assoc  20, LineSize  64
****************************----------------------------
Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--**----------------------------------------------------
Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
--**----------------------------------------------------
Unified Cache       2, Level 2,  256 KB, Assoc   8, LineSize  64
--**----------------------------------------------------
Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
----**--------------------------------------------------
Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
----**--------------------------------------------------
Unified Cache       3, Level 2,  256 KB, Assoc   8, LineSize  64
----**--------------------------------------------------
Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
------**------------------------------------------------
Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
------**------------------------------------------------
Unified Cache       4, Level 2,  256 KB, Assoc   8, LineSize  64
------**------------------------------------------------
Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**----------------------------------------------
Instruction Cache   4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**----------------------------------------------
Unified Cache       5, Level 2,  256 KB, Assoc   8, LineSize  64
--------**----------------------------------------------
Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**--------------------------------------------
Instruction Cache   5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**--------------------------------------------
Unified Cache       6, Level 2,  256 KB, Assoc   8, LineSize  64
----------**--------------------------------------------
Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**------------------------------------------
Instruction Cache   6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**------------------------------------------
Unified Cache       7, Level 2,  256 KB, Assoc   8, LineSize  64
------------**------------------------------------------
Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**----------------------------------------
Instruction Cache   7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**----------------------------------------
Unified Cache       8, Level 2,  256 KB, Assoc   8, LineSize  64
--------------**----------------------------------------
Data Cache          8, Level 1,   32 KB, Assoc   8, LineSize  64
----------------**--------------------------------------
Instruction Cache   8, Level 1,   32 KB, Assoc   8, LineSize  64
----------------**--------------------------------------
Unified Cache       9, Level 2,  256 KB, Assoc   8, LineSize  64
----------------**--------------------------------------
Data Cache          9, Level 1,   32 KB, Assoc   8, LineSize  64
------------------**------------------------------------
Instruction Cache   9, Level 1,   32 KB, Assoc   8, LineSize  64
------------------**------------------------------------
Unified Cache      10, Level 2,  256 KB, Assoc   8, LineSize  64
------------------**------------------------------------
Data Cache         10, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------**----------------------------------
Instruction Cache  10, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------**----------------------------------
Unified Cache      11, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------**----------------------------------
Data Cache         11, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------**--------------------------------
Instruction Cache  11, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------**--------------------------------
Unified Cache      12, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------**--------------------------------
Data Cache         12, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------**------------------------------
Instruction Cache  12, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------**------------------------------
Unified Cache      13, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------**------------------------------
Data Cache         13, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------**----------------------------
Instruction Cache  13, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------**----------------------------
Unified Cache      14, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------------**----------------------------
Data Cache         14, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------**--------------------------
Instruction Cache  14, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------**--------------------------
Unified Cache      15, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------------**--------------------------
Unified Cache      16, Level 3,   35 MB, Assoc  20, LineSize  64
----------------------------****************************
Data Cache         15, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------**------------------------
Instruction Cache  15, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------**------------------------
Unified Cache      17, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------------**------------------------
Data Cache         16, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------**----------------------
Instruction Cache  16, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------**----------------------
Unified Cache      18, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------------------**----------------------
Data Cache         17, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------**--------------------
Instruction Cache  17, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------**--------------------
Unified Cache      19, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------------------**--------------------
Data Cache         18, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------**------------------
Instruction Cache  18, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------**------------------
Unified Cache      20, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------------------**------------------
Data Cache         19, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------**----------------
Instruction Cache  19, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------**----------------
Unified Cache      21, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------------------------**----------------
Data Cache         20, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------**--------------
Instruction Cache  20, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------**--------------
Unified Cache      22, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------------------------**--------------
Data Cache         21, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------**------------
Instruction Cache  21, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------**------------
Unified Cache      23, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------------------------**------------
Data Cache         22, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------------**----------
Instruction Cache  22, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------------**----------
Unified Cache      24, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------------------------------**----------
Data Cache         23, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------------**--------
Instruction Cache  23, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------------**--------
Unified Cache      25, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------------------------------**--------
Data Cache         24, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------------**------
Instruction Cache  24, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------------**------
Unified Cache      26, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------------------------------**------
Data Cache         25, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------------------**----
Instruction Cache  25, Level 1,   32 KB, Assoc   8, LineSize  64
--------------------------------------------------**----
Unified Cache      27, Level 2,  256 KB, Assoc   8, LineSize  64
--------------------------------------------------**----
Data Cache         26, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------------------**--
Instruction Cache  26, Level 1,   32 KB, Assoc   8, LineSize  64
----------------------------------------------------**--
Unified Cache      28, Level 2,  256 KB, Assoc   8, LineSize  64
----------------------------------------------------**--
Data Cache         27, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------------------**
Instruction Cache  27, Level 1,   32 KB, Assoc   8, LineSize  64
------------------------------------------------------**
Unified Cache      29, Level 2,  256 KB, Assoc   8, LineSize  64
------------------------------------------------------**

Logical Processor to Group Map:
Group 0:
****************************----------------------------
Group 1:
----------------------------****************************

This is the MsInfo32 command dump (information about the server):

OS Name            Microsoft Windows Server 2012 R2 Standard
Version               6.3.9600 Build 9600
Other OS Description    Not Available
OS Manufacturer            Microsoft Corporation
System Name   EMTP6
System Manufacturer   HP
System Model  ProLiant DL360 Gen9
System Type     x64-based PC
System SKU       755258-B21
Processor           Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, 2597 Mhz, 14 Core(s), 28 Logical Processor(s)
Processor           Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, 2597 Mhz, 14 Core(s), 28 Logical Processor(s)
BIOS Version/Date         HP P89, 7/11/2014
SMBIOS Version              2.8
Embedded Controller Version 2.02
BIOS Mode         UEFI
Platform Role   Enterprise Server
Secure Boot State           Off
PCR7 Configuration       Not Available
Windows Directory        ---removed
System Directory            ---removed
Boot Device       \Device\HarddiskVolume2
Locale   United States
Hardware Abstraction Layer      Version = "6.3.9600.17196"
User Name         Not Available
Time Zone          Eastern Standard Time
Installed Physical Memory (RAM)          256 GB
Total Physical Memory 256 GB
Available Physical Memory       246 GB
Total Virtual Memory   294 GB
Available Virtual Memory          283 GB
Page File Space               38.0 GB
Page File             ---removed
Hyper-V - VM Monitor Mode Extensions            Yes
Hyper-V - Second Level Address Translation Extensions             Yes
Hyper-V - Virtualization Enabled in Firmware  Yes
Hyper-V - Data Execution Protection    Yes

This is the screen shot of TaskManager and my program results:

enter image description here

Or, if Windows decided to start it on node 1:

enter image description here

Expected behavior from another Server:

OS Name Microsoft Windows Server 2008 HPC Edition
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description    Not Available
OS Manufacturer Microsoft Corporation
System Name COMPUTE-13-6
System Manufacturer HP
System Model    ProLiant DL160 G6
System Type x64-based PC
Processor   Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz, 3068 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor   Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz, 3068 Mhz, 6 Core(s), 6 Logical Processor(s)
BIOS Version/Date   HP O33, 7/1/2013
SMBIOS Version  2.7
Windows Directory   C:\Windows
System Directory    C:\Windows\system32
Boot Device \Device\HarddiskVolume1
Locale  United States
Hardware Abstraction Layer  Version = "6.1.7601.17514"
User Name   Not Available
Time Zone   Eastern Standard Time
Installed Physical Memory (RAM) 48.0 GB
Total Physical Memory   48.0 GB
Available Physical Memory   40.9 GB
Total Virtual Memory    96.0 GB
Available Virtual Memory    88.4 GB
Page File Space 48.0 GB
Page File   C:\pagefile.sys

enter image description here

Note: I thought we fixed the problem by changing "Interleaved Memory" parameter in the bios. But i gives us weird results. According to Microsoft Technet we set back the BIOS setting to "NON-Interleaved memory"** (which is required by the OS to see the system as NUMA).

Henry
  • 2,953
  • 2
  • 21
  • 34
Eric Ouellet
  • 10,996
  • 11
  • 84
  • 119
  • Have you tried looking at the initial threadpool size? – Yuval Itzchakov Jan 22 '15 at 20:38
  • 1
    My crystal ball says that the .config file is not being used. A simple way to check is to intentionally screw up the Version attribute, set it to 4.5.5 :) – Hans Passant Jan 22 '15 at 23:08
  • @Yuval Itzchakov, No I haven't look the initial thread pool size but if I'm right, it suppose to balance itself automatically according to its number of job to do. In this case a lot more than 28 (ie: 80). – Eric Ouellet Jan 23 '15 at 14:20
  • 2
    @Hans Passant, Nice try... Result: "This application requires one of the following versions of the .net Framework: .Net Framework, Version=v4.5.5. Do you want to install this .NET Framework version now?". I sounds like the App.config is used. I'm happy to know that behavior but it does not solve the problem. Any other idea? Up to now, it appears to me to be a bug in the framework... I'm near to report a bug at connect.microsoft.com. What is weird is that its a pretty new feature which I would expect to work fine. – Eric Ouellet Jan 23 '15 at 14:28
  • No, Connect will quickly close this with "no repro". This is Microsoft Support material, they'll help you step through troubleshooting this. Reminding you about the details you skipped here, like an actual repro program and stuff like chipset driver versions and whatnot. – Hans Passant Jan 23 '15 at 15:40
  • @Hans, I have a repro program. You think the chipset could actually be important in this case? I can't find anything anywhere which could let me think I would have to change a driver, an option in the registry (onlyUseOnlyOneGroup or so...), an option in the server, the bios firmware, or anything else. If there is something else, then there should exists a way to reach that documentation or ideally a tool to diagnose the problem. – Eric Ouellet Jan 23 '15 at 16:16
  • @Hans, I added my code in GitHub. A link is added in question for the code and I will add more info about the server in few seconds. – Eric Ouellet Jan 23 '15 at 19:00
  • 2
    Is it possible that because HyperThread cores don't truly equate to full CPU cores and that 28 is the actual number of complete CPU cores you have that's why you're seeing this behaviour? Here's some relevant discussion: https://social.msdn.microsoft.com/Forums/vstudio/en-US/6bd174ea-3a29-441b-a43f-e61d44497029/cpu-usage-in-parallelfor-c?forum=parallelextensions – chsh Jan 26 '15 at 14:29
  • 2
    @chsh, HyperThread is not real logical processor, I agree, but I don't think it is related for 2 reasons: 1- On my machine: 1 physical processor - 6 hyperthreaded cores, my program run 12 threads simultaneously. 2- On our Windows 2012 server, every threads run on all logical processors of only one processor group. If it were hyperthreaded related, threads would be shared among both processor group, which is not the case. Additional notes: Depending on type of work, it could be more beneficial to not use hyperthreading. It is also possible that the thread pool manage that aspect, I have no idea – Eric Ouellet Jan 26 '15 at 14:46
  • interesting question. Ping me if you don't get an authoritative answer and I'll post a larger bounty. – jgauffin Jan 30 '15 at 22:45
  • .NET gives priority to the actual physical cores, so all processor groups are engaged, unless you modified process. Unless memory access is involved, hyperthreading doesn't help and essentially wastes CPU cycles. The scheduler likely tries to avoid hyperthreading for this reason. Check [this article](https://dupdob.wordpress.com/2013/03/25/hyperthreading-and-performance/) for a similar situation – Panagiotis Kanavos Feb 02 '15 at 13:54
  • @jgauffin, I ping :-) ??? I contacted HP and I will try on more server to see the behavior today. – Eric Ouellet Feb 02 '15 at 15:16
  • @EricOuellet: Just like you did :) – jgauffin Feb 02 '15 at 19:37
  • note: to capture the current window use `Alt+PrintScreen`. The feature is also available in snipping tool and other 3rd party screenshot utilities – phuclv Mar 10 '18 at 06:12
  • @LưuVĩnhPhúc, Thanks, next time I will take care taking only the window :-) !!! – Eric Ouellet Mar 11 '18 at 23:01

4 Answers4

13

The bug has been fixed by a new (yet unpublished) HP Bios (at the time of writing this).

The new Bios (targeting HP Proliant DL360 and DL380 Gen9) introduce a new setting: "NUMA Group Size Optimization" with choice of [Clustered - default] or [Flat]. HP says to set it to flat.

The sceenshot part of this answer has been conducted on a DL380 instead of a DL360 because of server availability. But I expect same behavior on DL360. The problem disapeared, we had only one group.

As far as I know, the OS communicate with the BIOS to know the CPU(s) configuration. The Bios play an important role in how the OS will present the logical processors available to applications (Processor Group, Affinity, etc).

About the Microsoft documentation Supporting Systems That Have More Than 64 Processors and Processor Groups it is clearly stated that more than one processor group will only be created when the Logical Processor (LC) count is >64. On our server (56 LC) with Numa Architecture set to "Clustered" we had 2 processor groups. A hardware engineer working at HP Bios dev team explained me that when set to "Clustered", the Bios is fooling Windows by padding the real number of logical processor to 72 Logical Processor (the max number of Logical Processor for the E5 v3 Family). The real number of LC is 56 in our DL360. That's the reason why we add 2 groups instead of 1. The Microsoft documentation seems accurate. I personally think that it would be better to create 1 group per numa node whenever possible but in our case, there is a bug. What is faulty is hard to know between HP or Microsoft when the HP Bios setting is set to Clustered (default) but Microsoft seems to not support that option which seems to cause our problem.

On HP Bios for DL360 and DL380, The Bios configuration "Numa Configuration" set to "Clustered" (default) will create 2 groups although there is only 56 Logical Processors (when hyperthreaded). The result is that only one processor is visible at a time for any application. Probably also due to HP fooling Windows by padding fake number of Logical Processors. It sounds like Microsoft does not expect that. Our C# app can't run on the 2 groups. It's hard to blame Microsoft on that behavior where HP does something they can't anticipated. Perhaps we will see, one day, Windows supporting many groups when LC <= 64.

About Prime95. This CPU stress test software has good documentation on Wikipedia that clearly state that it will load into only one processor group (in Limits section).

Running in Numa Architecture set to Flat

Eric Ouellet
  • 10,996
  • 11
  • 84
  • 119
5

Try setting your code to build "optimize code" and the target platform to "x64". (it worked for me with your code, on a server with 80 cores)

This is our MsInfo32:

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Manufacturer IBM

System Model System x3850 X5

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz, 2395 Mhz, 10 Core(s), 20 Logical Processor(s)

Processor Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz, 2395 Mhz, 10 Core(s), 20 Logical Processor(s)

Processor Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz, 2395 Mhz, 10 Core(s), 20 Logical Processor(s)

Processor Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz, 2395 Mhz, 10 Core(s), 20 Logical Processor(s)

BIOS Version/Date IBM Corp. -[G0E179BUS-1.79]-, 28-07-2013

SMBIOS Version 2.7

Embedded Controller Version 255.255

BIOS Mode UEFI

BaseBoard Manufacturer IBM

BaseBoard Model Not Available

BaseBoard Name Base Board

Platform Role Enterprise Server

Secure Boot State Unsupported

PCR7 Configuration Not Available

Hardware Abstraction Layer Version = "6.3.9600.17031"

User Name Not Available

Time Zone Romance Standard Time

Installed Physical Memory (RAM) 128 GB

Total Physical Memory 128 GB

Available Physical Memory 53,0 GB

Total Virtual Memory 147 GB

Available Virtual Memory 67,7 GB

Hyper-V - VM Monitor Mode Extensions Yes

Hyper-V - Second Level Address Translation Extensions Yes

Hyper-V - Virtualization Enabled in Firmware Yes

Hyper-V - Data Execution Protection Yes

enter image description here

Community
  • 1
  • 1
Brian
  • 51
  • 3
  • Same behavior with configuration set to "Release - x64" (note: by default Optimize code is set in Release). I wonder what is your configuration ==> can you run "MsInfo32" and give me the result of it? Note: use the menu Edit-selectAll and Edit-Copy to put everything in the clipboard. It is really weird that it works ok on your server and not on our... – Eric Ouellet Jan 26 '15 at 15:08
  • 1
    You "only" have 56 cores, so I do not beleave that is´t related to processer grouping in windows, for that you have to have more than 64 cores. – Brian Jan 27 '15 at 07:30
  • I'm sure we have 2 nodes/processor groups. I agree with you that the documentation is not very clear about the "64". I think it mean that if one cpu has more than 64 Logical processors, Windows will automatically split them into some "virtual" processor group. But i'm not sure about that. The only think I'm sure, it is that we really have 2 nodes on our Windows 2012 server. – Eric Ouellet Jan 27 '15 at 16:15
  • Thanks a lots for the information (dump of Msinfo32). – Eric Ouellet Jan 27 '15 at 16:16
  • That's really weird because it really works as expected on your server. It means that C# is fine. It should be related to the OS configuration (registry or driver) or the hardware (ie: bios or physical hardware)... ??? – Eric Ouellet Jan 27 '15 at 16:23
  • @EricOuellet NUMA Node "Count" is completely different meaning to the Processor Group "Count". I bet Windows will not split <64 LPs. Please try http://blogs.microsoft.co.il/sasha/2009/07/25/net-support-for-more-than-64-processors/ and post the result. – masaki Jan 28 '15 at 08:41
  • @masaki, i slightly modify my question in order to be more accurate about node vs processorGroup. But actually TaskManager show us tooltip with "Node", not "ProcessorGroup" while Microsoft uses both for the same meaning. – Eric Ouellet Jan 28 '15 at 15:02
  • @EricOuellet, No. not at all. Microsoft introduced the "Processor Group" because of >64 LPs break ULONG_PTR limit. it means an affinity mask, thread scheduler, etc... But there is more efficient way if you don't have <= 64 LPs, identify each LP in one bit. Why don't you share how Windows think there is N of Processor Group by API or run process twice results another circumstances, ... You can believe there is 2-NUMA Nodes (this is true) and there is 2-Processor Groups, but I couldn't get any information there is "2-Processor Groups". Why don't you try GetActiveProcessorGroupCount() ... – masaki Jan 28 '15 at 16:36
  • @Brian, if you change .config to configuration/runtime/Thread_UseAllCpuGroups@enabled=false, your machine show to only use 20 LPs? – masaki Jan 28 '15 at 16:41
  • @masaki, Thanks! I included the function you recommended to me. I also included a screen capture of my program running together with the task manager with node view. – Eric Ouellet Jan 28 '15 at 18:10
  • @masaki, I just discovered a tool from SysInternals (microsoft) : "CoreInfo". I wonder if that tool could give us a hint on the problem. Do you think you can try it on your server and put the result on your answer. I will do the same for my question and I (we) will be able to compare results. The link is as follow: https://technet.microsoft.com/en-us/sysinternals/cc835722.aspx – Eric Ouellet Jan 30 '15 at 16:59
  • @masaki, I also added 2 more information in my little program: GetProcessAffinityMask and I show the result which seems really interesting: ProcessAffinityMask and SystemAffinityMask. I wonder if results are really supposed to be per processorGroup, because on our server it is the case: both show the 28 first bits set to "1". – Eric Ouellet Jan 30 '15 at 21:47
  • @Brian, I modified my program. I added a call to Kernel32 GetProcessAffinityMask which give me System and Process affinity. Actually the problem seems to be highly related to the fact that those affinity show us only the number of bits corresponding to one socket (one numa node) instead of the 2. Do you think you can re-run my program and update your screenshot in order to have better comparisons. It would help me with Level 3 of HP support. – Eric Ouellet Feb 12 '15 at 19:37
2

And you should ensure there is no Job restrictions.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms684147(v=vs.85).aspx

You can check by Process Explorer

https://technet.microsoft.com/en-us/sysinternals/bb896653.aspx

Calculator with CPU Affinity Limit

masaki
  • 121
  • 1
  • 6
  • And ensure 'System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity' is 63. – masaki Jan 26 '15 at 15:16
  • The limit structure you refer to is pretty deep. I didn't play with that and I let C# manage that aspect. I take into account that Microsoft did a good job on that. But Microsoft say that by setting Thread_UseAllCpuGroups to true (with proper associated properties), it should use all logical processor (of all groups). I think you go a little to deep. I don't think Microsoft expect us to play so deep. There should be an easy fix (software in code or server configuration) of perhaps there is a bug either at Microsoft or on the our server Hardware? That's what I'm looking for. – Eric Ouellet Jan 26 '15 at 15:39
  • I still think this is *not* CPU group's problem. Because there is only <64 Logical Processors and I could see there is no evidence what I concerned. Has Windows architectural limit on over 64 core? Yes. Because ULONG_PTR has only 64bits in max. So they introduced a CPU Groups. But Eric, you never show us there is more than one CPU Groups or something. If you stick with CPU groups, try [http://multiproc.codeplex.com/] and [http://blogs.microsoft.co.il/sasha/2009/07/25/net-support-for-more-than-64-processors/]. Your information is learn to answer. – masaki Jan 26 '15 at 18:43
  • I added some information about ProcessorGroup/Node in my question to ensure that it is clear that we are using a 2 nodes machine. I'm not 100% sure but I highly thing that today's new machine that have more than 1 cpu are all NUMA and by the way are separated in ProcessorGroup, 1 group(node) by CPU. There could be also more node per physical CPU when there is more than one CPU per diy(per physical CPU) - Less common theses days (I think). – Eric Ouellet Jan 26 '15 at 19:44
  • To know the Processor Group count, you should use GetActiveProcessorGroupCount() in x64 build. [DllImport("kernel32.dll", SetLastError = true)] static extern ushort GetActiveProcessorGroupCount(); – masaki Jan 28 '15 at 15:35
  • I added the processAffinity and it show only the 28 first bits (0 -27) or the 28 next ones (28-55). That's where lies the problem. – Eric Ouellet Feb 02 '15 at 15:13
1

Will this program deadlock or not? I can't determine your thread pool is fully expanded or not.

using System.Linq;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static void Main(string[] args)
    {
        var threads = 100;
        int workerThreads, completionPortThreads;

        ThreadPool.GetMaxThreads(out workerThreads, out completionPortThreads);
        ThreadPool.SetMaxThreads(threads, completionPortThreads);
        ThreadPool.GetMinThreads(out workerThreads, out completionPortThreads);
        ThreadPool.SetMinThreads(threads, completionPortThreads);

        var ce = new CountdownEvent(threads + 1);
        Enumerable.
            Range(0, threads).
            Select(xs => Task.Factory.StartNew(() => { ce.Signal(); ce.Wait(); })).
            ToArray();

        ce.Signal();
        ce.Wait();
    }
}
masaki
  • 121
  • 1
  • 6
  • I don't understand your point. My program test against regular thread object and also with the thread pool. The problem exists in both cases. If I understand you right, you wonder if by default the thread pool limits its number of available threads to the number of logical processor per processor group? What is the purpose of your code? – Eric Ouellet Jan 26 '15 at 14:41
  • I can't read how many threads running from your picture, and using raw managed thread or thread pool. So I need confirm your code have enough threads running. If above test run, change 'ce.Wait()' to 'Thread.SpinWait(int.MaxValue)' next step, even so you'll see only CPU #1 is used, I want you to confirmation there is no Job object. – masaki Jan 26 '15 at 15:09
  • FYI: ThreadPool.MaxThreads, workerThreads=32767, completionPortThreads=1000. I don't think it is quite related. – Eric Ouellet Jan 26 '15 at 16:14