1

I am running a linear regression model and I would like to add the coefficients and P-values of each variable and the variable name in to the metrics of the mlflow output. I am new to using mlflow and not very familiar in doing this. Below is an example of part of my code

with mlflow.start_run(run_name=p_key + '_' + str(o_key)):
    
    lr = LinearRegression(
      featuresCol = 'features',
      labelCol = target_var,
      maxIter = 10,
      regParam = 0.0,
      elasticNetParam = 0.0,
      solver="normal"
        )
    
    lr_model_item = lr.fit(train_model_data)
    lr_coefficients_item = lr_model_item.coefficients
    lr_coefficients_intercept = lr_model_item.intercept
    
    lr_predictions_item = lr_model_item.transform(train_model_data)
    lr_predictions_item_oos = lr_model_item.transform(test_model_data)
    
    rsquared = lr_model_item.summary.r2
    
    # Log mlflow attributes for mlflow UI
    mlflow.log_metric("rsquared", rsquared)
    mlflow.log_metric("intercept", lr_coefficients_intercept)
    for i in lr_coefficients_item:
      mlflow.log_metric('coefficients', lr_coefficients_item[i])

Would like to know whether this is possible? In the final output I should have the intercept, coefficients, p-values and the relevant variable name.

Gun
  • 169
  • 3
  • 12

1 Answers1

1

If I understand you correctly, you want to register the p-value and coefficient per variable name separately in MLFlow. The difficult thing in with Spark ML is that all columns are generally merged into a single "features" column before passing it on to a given estimator (e.g. LinearRegression). Therefore, one looses the oversight of which name belongs to which column.

We can get the names of every feature in the "features" column from your linear model by defining the following function [1]:

from itertools import chain

def feature_names(model, df):
  features_dict = df.schema[model.summary.featuresCol].metadata["ml_attr"]["attrs"].values()
  return sorted([(attr["idx"], attr["name"]) for attr in chain(*features_dict)])

The above function returns a sorted list that contains a list of tuples, in which the first entry corresponds to the index of the feature in the "features" column, and the second entry to the name of the actual feature.

By using the above function in your code, we can now easily match the feature names with the column in the "features" column, and therefore register the coefficient and p-value per feature.

def has_pvalue(model):
  ''' Check if the given model supports pValues associated '''
  try:
    model.summary.pValues
    return True
  except:
    return False


with mlflow.start_run():
  lr = LinearRegression(
    featuresCol="features",
    labelCol="label",
    maxIter = 10,
    regParam = 1.0,
    elasticNetParam = 0.0,
    solver = "normal"
  )
  lr_model = lr.fit(train_data)

  mlflow.log_metric("rsquared", lr_model.summary.r2)
  mlflow.log_metric("intercept", lr_model.intercept)
  
  for index, name in feature_names(lr_model, train_data):
    mlflow.log_metric(f"Coef. {name}", lr_model.coefficients[index])
    if has_pvalue(lr_model):
      # P-values are not always available. This depends on the model configuration.
      mlflow.log_metric(f"P-val. {name}", lr_model.summary.pValues[index])

[1]: Related Stackoverflow question

Bram
  • 376
  • 1
  • 4