I want to do a multiple linear regression in pyspark where
- y = Bx = b1 * x1 + b2 * x2 + b3 * x3 (assume the number of features is 3)
- Sum of weights in vector B equals to 1;
- Each weight is non-negative and no greater than 1.
I learned how to do so in python with scipy.optimize.minimize
through these two questions: constrained linear regression / quadratic programming python & Simple linear regression with constraint.
I also knew that in Spark, simple linear regression can be done with library pyspark.ml.regression.LinearRegression
, but didn't find which parameters I can tweak to meet the constraints.
I tried to create a UDF in spark and use the scipy
method in python (following this post Pyspark dataframe: how to apply scipy.optimize function by group). Put aside the slow performance, the weights also turned out to be wrong.
Below is a simplified example for testing:
import numpy as np
from scipy.optimize import minimize
x1 = np.array([5.2, 5.3, 4.2, 3.9])
x2 = np.array([4.0, 5.0, 6.0, 3.1])
x3 = np.array([5.4, 6.2, 4.9, 4.7])
m = np.vstack([x1,x2,x3])
y = np.array([5.3, 4.9, 5.6, 4.1])
startval = np.zeros(m.shape[0])
cons = ({'type': 'eq',
'fun' : lambda x: np.sum(x) - 1.0})
bnds = [(0, 1) for i in range(m.shape[0])]
def loss_ols(x):
return np.sum(np.square(np.linalg.norm(np.dot(x, m) - y, 2)))
res = minimize(loss_ols, startval, method='SLSQP', constraints=cons,
bounds=bnds, options={'disp': True})
print('Weights: {}.'.format(np.round(res.x,2)))
The resulted weights are [0.19 0.43 0.38]
. But the results in spark is [0.9999999999999977, 1.2212453270876722E-15, 1.2212453270876722E-15]
I think there must be something wrong with my Spark code. So would like to hear from your opinions on whether
My questions are:
- Is there any function in the Spark ML library where the constraints and bounds can be specified?
- If not, could you please share a faster and correct way of mapping the
minimize
function?