1

My overall goal is to run a lot of instances of sklearn LogisticRegression on the same data in parallel using python multiprocessing.Pool. For this I compared the time of running unparalleled code with paralleled code. Here is the common prefix defining the function running LogisticRegression and creating data:

def f(data):
    t1 = time()
    X_tr, X_te, y_tr, y_te = data
    lr = LogisticRegression(n_jobs=1)
    lr.fit(X_tr, y_tr)
    res = accuracy_score(y_te, lr.predict(X_te))
    print(time() - t1)
    return res

K = 20

data = np.random.random((100000, 5))
target = np.random.randint(0, 10, (100000))

Here is the simple code using a single thread:

res = []
for i in range(K):
    res.append(f(train_test_split(data.copy(), target.copy(), train_size=0.7)))
print(sum(res) / len(res))

Here is the code using multiprocessing:

with Pool(10) as p:
    res = list(p.map(f, [train_test_split(data.copy(), target.copy(), train_size=0.7) for x in range(K)]))
print(sum(res) / len(res))

This produced weird results that I later explained by numpy using all threads by default: the second code did not run faster than the first one and in the second code the output of print(time() - t1) was much higher than in the first code. I managed to fix this locally by adding

import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"

However, when I moved the code to google colab, I found out that the os.environ line does not solve the problem and I still get the same result as without adding the os.environ line on my laptop.

I browsed through a dozen of numpy-multithreading-related questions and added a couple of other os.environ lines and got this:

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1" 
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OPENBLAS_MAIN_FREE"] = "1"

But this still had no effect. The final snippet of code that does not work in google colab correctly, but does work on my laptop correctly is this:

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OPENBLAS_MAIN_FREE"] = "1"

from multiprocessing import Pool, cpu_count
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from time import time
import numpy as np

def f(data):
    t1 = time()
    X_tr, X_te, y_tr, y_te = data
    lr = LogisticRegression(n_jobs=1)
    lr.fit(X_tr, y_tr)
    res = accuracy_score(y_te, lr.predict(X_te))
    print(time() - t1)
    return res

K = 20

data = np.random.random((100000, 5))
target = np.random.randint(0, 10, (100000))

res = []
for i in range(K):
    res.append(f(train_test_split(data.copy(), target.copy(), train_size=0.7)))
print(sum(res) / len(res))

with Pool(10) as p:
    res = list(p.map(f, [train_test_split(data.copy(), target.copy(), train_size=0.7) for x in range(K)]))
print(sum(res) / len(res))

Does anyone have any idea how to fix this to get google colab to run numpy without multithreading?

0 Answers0