I'm working with scikitlearn and I have created a function that loops through several parameters of an elasticnet regression. My features are in a 500000 x 1100000 sparse matrix, thus just one iteration takes up to an hour. Therefore I was looking into multiprocessing as well as into parallel computing with ipyparallel but I do not really understand the concept or how to apply it to my function. Which approach (multiprocessing, ipyparallel, or something else?) would you recommend for my setup? Can you give me a first idea how to apply it?
This is my function applying the pipelined elasticnet regression:
def do_penalized_regression(x, y, alphavalue, lambdavalue):
enr = ElasticNetCV(n_alphas = lambdavalue, l1_ratio = alphavalue, normalize=False)
pipeliner = make_pipeline(StandardScaler(with_mean=False), enr)
pipeliner.fit(x, y)
return enr
And this is the function iterating over different alpha parameters and sending back model fit indices:
d = {}
evaluation = {}
lambdavalue = 100
def alpha_lambda_evaluation(x):
Xtrain, Xtest, Ytrain, Ytest = sklearn.model_selection.train_test_split(x, Y, test_size=.2, random_state = 42)
for alphavalue in np.arange(0.05, 1.0, 0.05):
enr = do_penalized_regression(Xtrain, Ytrain, alphavalue, lambdavalue)
rmse = get_rmse(enr, Xtest, Ytest)
aic = get_aic(enr, Xtest, Ytest)
bic = get_bic(enr, Xtest, Ytest)
rsquared = get_rsquared(enr, Xtest, Ytest)
F = get_F_value_test(enr, Xtest, Ytest)
d[f'alpha_{alphavalue}'] = [rmse, aic, bic, rsquared, F]
return d
evaluation = pd.DataFrame(d)
evaluation = evaluation.rename({0: "rmse", 1: "aic", 2: "bic", 3: "rsquared", 4: "F"})
I'm pretty new to python and to programming, any suggestions and tips how to improve my code are welcome!