0

I'm working with scikitlearn and I have created a function that loops through several parameters of an elasticnet regression. My features are in a 500000 x 1100000 sparse matrix, thus just one iteration takes up to an hour. Therefore I was looking into multiprocessing as well as into parallel computing with ipyparallel but I do not really understand the concept or how to apply it to my function. Which approach (multiprocessing, ipyparallel, or something else?) would you recommend for my setup? Can you give me a first idea how to apply it?

This is my function applying the pipelined elasticnet regression:

def do_penalized_regression(x, y, alphavalue, lambdavalue):
    enr = ElasticNetCV(n_alphas = lambdavalue, l1_ratio = alphavalue, normalize=False)
    pipeliner = make_pipeline(StandardScaler(with_mean=False), enr)
    pipeliner.fit(x, y) 
    return enr

And this is the function iterating over different alpha parameters and sending back model fit indices:

d = {}
evaluation = {}
lambdavalue = 100

def alpha_lambda_evaluation(x):
Xtrain, Xtest, Ytrain, Ytest = sklearn.model_selection.train_test_split(x, Y, test_size=.2, random_state = 42)

    for alphavalue in np.arange(0.05, 1.0, 0.05):
        enr = do_penalized_regression(Xtrain, Ytrain, alphavalue, lambdavalue)
        rmse = get_rmse(enr, Xtest, Ytest)
        aic = get_aic(enr, Xtest, Ytest)
        bic = get_bic(enr, Xtest, Ytest)
        rsquared = get_rsquared(enr, Xtest, Ytest)
        F = get_F_value_test(enr, Xtest, Ytest)
        d[f'alpha_{alphavalue}'] = [rmse, aic, bic, rsquared, F]
    return d

evaluation = pd.DataFrame(d)
evaluation = evaluation.rename({0: "rmse", 1: "aic", 2: "bic", 3: "rsquared", 4: "F"})

I'm pretty new to python and to programming, any suggestions and tips how to improve my code are welcome!

cian
  • 191
  • 2
  • 11

0 Answers0