Constrained linear regression with pyspark

Question

I want to do a multiple linear regression in pyspark where

y = Bx = b1 * x1 + b2 * x2 + b3 * x3 (assume the number of features is 3)
Sum of weights in vector B equals to 1;
Each weight is non-negative and no greater than 1.

I learned how to do so in python with scipy.optimize.minimize through these two questions: constrained linear regression / quadratic programming python & Simple linear regression with constraint.

I also knew that in Spark, simple linear regression can be done with library pyspark.ml.regression.LinearRegression, but didn't find which parameters I can tweak to meet the constraints.

I tried to create a UDF in spark and use the scipy method in python (following this post Pyspark dataframe: how to apply scipy.optimize function by group). Put aside the slow performance, the weights also turned out to be wrong.

Below is a simplified example for testing:

import numpy as np
from scipy.optimize import minimize

x1 = np.array([5.2, 5.3, 4.2, 3.9])
x2 = np.array([4.0, 5.0, 6.0, 3.1])
x3 = np.array([5.4, 6.2, 4.9, 4.7])

m = np.vstack([x1,x2,x3])
y = np.array([5.3, 4.9, 5.6, 4.1])
startval = np.zeros(m.shape[0])

cons = ({'type': 'eq',
          'fun' : lambda x: np.sum(x) - 1.0})
bnds = [(0, 1) for i in range(m.shape[0])]

def loss_ols(x):
    return np.sum(np.square(np.linalg.norm(np.dot(x, m) - y, 2)))

res = minimize(loss_ols, startval, method='SLSQP', constraints=cons,
                bounds=bnds, options={'disp': True})
print('Weights: {}.'.format(np.round(res.x,2)))

The resulted weights are [0.19 0.43 0.38]. But the results in spark is [0.9999999999999977, 1.2212453270876722E-15, 1.2212453270876722E-15] I think there must be something wrong with my Spark code. So would like to hear from your opinions on whether

My questions are:

Is there any function in the Spark ML library where the constraints and bounds can be specified?
If not, could you please share a faster and correct way of mapping the minimize function?

Constrained linear regression with pyspark

0 Answers0