multiprocessing not achieving full CPU usage on dual-processor windows machine

Question

I am working on a dual-processor windows machine and am trying to run several independent python processes using the multiprocessing library. Of course, I am aiming to maximize the use of both CPU's in order to speed up computation time. The details of my machine are below:

OS: Windows 10 Pro for Workstations
RAM: 524 GB
Hard Drive: Samsung SSD PRO 960 (NVMe)
CPU: Xeon Gold 6154 (times 2)

I execute a master-script using Python 3.6, which then spawns 72 memory-independent workers using the multiprocessing library. Initially, all 72 cores of my machine are used at 100%. After about 5-10 minutes, however, all 36 of the cores on my second CPU reduce to 0% usage, while the 36 cores on the first CPU remain at 100%. I can't figure out why this is happening.

Is there something I am missing regarding the utilization of both CPU's in a dual-processor Windows machine? How can I ensure that the full potential of my machine is utilized? As a side note, I'm curious if this would be different if I were using a Linux OS? Thank you in advance for anyone who is willing to help with this.

A representation of my python master script is below:

import pandas as pd
import netCDF4 as nc
from multiprocessing import Pool

WEATHERDATAPATH = "C:/Users/..../weatherdata/weatherfile_%s.nc4"
OUTPUTPATH = "C:/Users/....outputs/result_%s.nc4"

def calculationFunction(year):
    dataset = nc.Dataset(WEATHERDATAPATH%year)

    # Read the data
    data1 = dataset["windspeed"][:]
    data2 = dataset["pressure"][:]
    data3 = dataset["temperature"][:]

    timeindex = nc.num2date(dataset["time"][:], dataset["time"].units)

    # Do computations with the data, primarily relying on NumPy
    data1Mean = data1.mean(axis=1)
    data2Mean = data2.mean(axis=1)
    data3Mean = data3.mean(axis=1)

    # Write result to a file
    result = pd.DataFrame( {"windspeed":data1Mean,
                            "pressure":data2Mean,
                            "temperature":data3Mean,}, 
                          index=timeindex )
    result.to_csv(OUTPUTPATH%year)

if __name__ == '__main__':
    pool = Pool(72)

    results = []
    for year in range(1900,2016): 
        results.append( pool.apply_async(calculationFunction, (year, )))

    for r in results: r.get()

As a follow up, I have done further testing which did not resolve the situation. I limited the master script to 36 cores, and executed it two times in parallel. After executing the first iteration, I waited for 10 minutes, until it appeared that all 36 cores on the first CPU were being consumed. Then I executed the second iteration and, once again, all 36 cores on the second machine were active for 5-10 minutes before the second CPU went silent. I even tried executing a third iteration of the script (meaning 108 simultaneous worker processes should be generated), but observed the same result — Severin, May 12 '18 at 13:28
One more follow-up. I asked another user on the machine to submit their own parallel task. They do not do the same simulations I do and instead submitted an optimization job using the commercial software Gurobi, and instructed it to use 36 cores. After their task was working (and fluctuating around completely using the first CPU) I executed my master script (limited at 36 cores). Once again, the same result is observed. Both CPUs are used completely for a time, until the second becomes silent while the first remains at 100% usage — Severin, May 12 '18 at 13:42
Make sure you're using 64-bit Python, since 32-bit processes are limited to 32 cores in a single processor group. 64-bit Windows manages logical cores in groups of up to 64 cores, taking locality into account. You probably have two groups, one for each CPU, each with 36 cores. By default, threads in a process are assigned to the same processor group. Other groups can be used either by manually creating threads with a different group affinity (`CreateRemoteThreadEx`, setting the group affinity in the `lpAttributeList`), or manually switching a thread's group affinity (`SetThreadGroupAffinity`). — Eryk Sun, May 12 '18 at 14:35
Thank you for the tip. I am indeed using the 64-bit version of Python. I tried looking into your suggestion about the processor group, and I do not see an option in python's multiprocessing module to designate a processor group. Maybe I am missing something? On the other hand, my understanding of the multiprocessing library is that it spawns independent processes on the OS, and is therefore distinct from multithreading. So even if there are two processor groups, as you suggest, shouldn't the multiple processes already be able to utilize them both? — Severin, May 12 '18 at 14:59
Ideally, when creating the 72 processes in the pool, Windows should distribute the processes evenly across processor groups. Of course, that only yields a balanced load if CPU-bound work is evenly distributed over all processes in the pool. It also assumes that processes aren't assigned to a Job object that limits them to a subset of available groups via the `JobObjectGroupInformation` limit. — Eryk Sun, May 12 '18 at 15:29
The limiting Job may have been inherited from the main process if the Job isn't configured to allow silent breakaway. In that case it may allow explicit breakaway via the `CREATE_BREAKAWAY_FROM_JOB` creation flag of `CreateProcess`, but multiprocessing doesn't support passing a custom `creationflags` to its internal `subprocess.Popen` call. — Eryk Sun, May 12 '18 at 15:34

score 2 · Accepted Answer · answered May 15 '18 at 22:09

It turns out the issue was with NumPy. As this solution explains, NumPy and several other similar packages rely on the BLAS library for numerical operation. This library uses multithreading to increase performance. But as multithreading is CPU-bound, this causes many operations performed by Numpy (which in my original code don't begin until the middle, as I've indicated), to be forced onto the first CPU.

The solution is to turn off the multithreading feature of the BLAS library. I'm not sure if this impacts performance, but in this case I think it will be okay. Luckily this is easy to do, I only had to set a single environment variable which I did directly in my python code:

import os
os.environ["OPENBLAS_MAIN_FREE"] = "1"

Now the machine runs at full capacity throughout my whole code :)

multiprocessing not achieving full CPU usage on dual-processor windows machine

1 Answers1