2

I've trained a model using PySpark and would like to compare its performance to that of an existing heuristic.

I just want to hardcode an LR model with the coefficients 0.1, 0.5, and 0.7, call .transform on the test data to get the predictions, and compute the accuracies.

How do I hardcode a model?

aabb
  • 23
  • 3
  • If you know the coefficients, why not just plus them into the [standard logistic function](https://en.wikipedia.org/wiki/Logistic_regression#Logistic_function,_odds,_odds_ratio,_and_logit). Essentially compute `p(x) = 1/(1 + exp(-(B0 + B1*X1 + B2*X2 + ... BnXn))`, which is the probability of class `1`. If `p(x) > 0.5`, pick class 1, else class 0. The `B`'s are the coefficients (`B0` is the intercept). – pault Jun 25 '18 at 14:04
  • @pault Ah yes, the fallback option would be to manually compute the predictions and calculate its accuracy. I was first hoping for a way to utilize the library to do so :) – aabb Jun 25 '18 at 17:58
  • Not sure if there's a built in method for this. You may have to define your own [custom Transformer](https://stackoverflow.com/questions/49734374/pyspark-ml-pipelines-are-custom-transformers-necessary-for-basic-preprocessing). – pault Jun 25 '18 at 21:05

2 Answers2

2

Unfortunately it's not possible to just set the coefficients of a pyspark LR model. The pyspark LR model is actually a wrapper around a java ml model (see class JavaEstimator).

So when the LR model is fit, it transfers the params from the paramMap to a new java estimator, which is fit to the data. All the LogisticRegressionModel methods/attributes are just calls to the java model using the _call_java method.

Since the coefficients aren't params (you can see a comprehensive list using explainParams on a LR instance), you can't pass them to the java LR model that's created, and there is not a setter method.

For example, for a logistic regression model lrm, you can see that the only setters are for the params you can set when you instantiate a pyspark LR instance: lowerBoundsOnCoefficients and upperBoundsOnCoefficients.

print([c for c in lmr._java_obj.__dir__() if "coefficient" in c.lower()])
# >>> ['coefficientMatrix', 'lowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$lowerBoundsOnCoefficients_$eq',
# 'getLowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$upperBoundsOnCoefficients_$eq',
# 'getUpperBoundsOnCoefficients', 'upperBoundsOnCoefficients', 'coefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionModel$$_coefficients']

Trying to set the "coefficients" attribute yields this:

print(lmr.coefficients)
# >>> DenseVector([18.9303, -18.9303])
lmr.coefficients = [10, -10]
# >>> AttributeError: can't set attribute

So you'd have to roll your own pyspark transformer if you want to be able to provide coefficients. It would probably be easier just to calculate results using the standard logistic function as per @pault's comment.

kshell
  • 236
  • 1
  • 6
1

You can set lower and upper bounds on coefficients of a LR model. In your case when you exactly know what you want - you can set the lower and upper bound coefficients to the same numbers and thats what you will get the same exact coefficients. You can set the coeffcients as dense matrix like this -

    from pyspark.ml.linalg import Vectors,Matrices
    a=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
    b=Matrices.dense(1, 3,[ 0.1,0.5,0.7])

and incroporate them into the model as hyperparamaters

    lr = LogisticRegression(featuresCol='features', labelCol='label', maxIter=10, 
     lowerBoundsOnCoefficients=a,\
     upperBoundsOnCoefficients=b, \
     threshold = 0.5)

and voila! you have your model.

You can then call fit & tranform on your model -

    best_mod=lr.fit(train)

   predict_train=best_mod.transform(train) # train data
   predict_test=best_mod.transform(test) # test data
Nidhi
  • 561
  • 4
  • 7