My overall goal is to run a lot of instances of sklearn LogisticRegression on the same data in parallel using python multiprocessing.Pool. For this I compared the time of running unparalleled code with paralleled code. Here is the common prefix defining the function running LogisticRegression and creating data:
def f(data):
t1 = time()
X_tr, X_te, y_tr, y_te = data
lr = LogisticRegression(n_jobs=1)
lr.fit(X_tr, y_tr)
res = accuracy_score(y_te, lr.predict(X_te))
print(time() - t1)
return res
K = 20
data = np.random.random((100000, 5))
target = np.random.randint(0, 10, (100000))
Here is the simple code using a single thread:
res = []
for i in range(K):
res.append(f(train_test_split(data.copy(), target.copy(), train_size=0.7)))
print(sum(res) / len(res))
Here is the code using multiprocessing:
with Pool(10) as p:
res = list(p.map(f, [train_test_split(data.copy(), target.copy(), train_size=0.7) for x in range(K)]))
print(sum(res) / len(res))
This produced weird results that I later explained by numpy using all threads by default: the second code did not run faster than the first one and in the second code the output of print(time() - t1)
was much higher than in the first code. I managed to fix this locally by adding
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
However, when I moved the code to google colab, I found out that the os.environ
line does not solve the problem and I still get the same result as without adding the os.environ
line on my laptop.
I browsed through a dozen of numpy-multithreading-related questions and added a couple of other os.environ
lines and got this:
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OPENBLAS_MAIN_FREE"] = "1"
But this still had no effect. The final snippet of code that does not work in google colab correctly, but does work on my laptop correctly is this:
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OPENBLAS_MAIN_FREE"] = "1"
from multiprocessing import Pool, cpu_count
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from time import time
import numpy as np
def f(data):
t1 = time()
X_tr, X_te, y_tr, y_te = data
lr = LogisticRegression(n_jobs=1)
lr.fit(X_tr, y_tr)
res = accuracy_score(y_te, lr.predict(X_te))
print(time() - t1)
return res
K = 20
data = np.random.random((100000, 5))
target = np.random.randint(0, 10, (100000))
res = []
for i in range(K):
res.append(f(train_test_split(data.copy(), target.copy(), train_size=0.7)))
print(sum(res) / len(res))
with Pool(10) as p:
res = list(p.map(f, [train_test_split(data.copy(), target.copy(), train_size=0.7) for x in range(K)]))
print(sum(res) / len(res))
Does anyone have any idea how to fix this to get google colab to run numpy without multithreading?