numpy vectorized approach to regression -multiple dependent columns (x) on single independent columns (y)

Question

consider the below (3, 13) np.array

from scipy.stats import linregress

a = [-0.00845,-0.00568,-0.01286,-0.01302,-0.02212,-0.01501,-0.02132,-0.00783,-0.00942,0.00158,-0.00016,0.01422,0.01241]
b = [0.00115,0.00623,0.00160,0.00660,0.00951,0.01258,0.00787,0.01854,0.01462,0.01479,0.00980,0.00607,-0.00106]
c = [-0.00233,-0.00467,0.00000,0.00000,-0.00952,-0.00949,-0.00958,-0.01696,-0.02212,-0.01006,-0.00270,0.00763,0.01005]
array = np.array([a,b,c])
yvalues = pd.to_datetime(['2019-12-15','2019-12-16','2019-12-17','2019-12-18','2019-12-19','2019-12-22','2019-12-23','2019-12-24',\
                    '2019-12-25','2019-12-26','2019-12-29','2019-12-30','2019-12-31'], errors='coerce')

I can run the OLS regression on one column at a time successfully, as in below:

out = linregress(array[0], y=yvalues.to_julian_date())
print(out)
LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252878, stderr=105.71465449878443)

However, what i wish to accomplish is to: run the regression on the matrix array with 'y' variable (yvalues) being constant for all columns -in one go (loop is possible solution but tiresome). I tried to extend 'yvalues' to match array shape with (np.tile). but is seems not to be the right approach. thank you all for your help.

Does this answer your question? [Multi-variable linear regression with scipy linregress](https://stackoverflow.com/questions/37985759/multi-variable-linear-regression-with-scipy-linregress) — Jacques Gaudin, Mar 25 '20 at 12:39

score 1 · Answer 1 · answered Mar 25 '20 at 16:33

IIUC you are looking for something like the following list comprehension in a vectorized way:

out = [linregress(array[i], y=yvalues.to_julian_date()) for i in range(array.shape[0])]

out
[LinregressResult(slope=329.141087037396, intercept=2458842.411731361, rvalue=0.684426534581417, pvalue=0.009863937200252876, stderr=105.71465449878443),
 LinregressResult(slope=178.44888292241782, intercept=2458838.7056912296, rvalue=0.1911788042719021, pvalue=0.5315353013148307, stderr=276.24376878908953),
 LinregressResult(slope=106.86168938856262, intercept=2458840.7656617565, rvalue=0.17721031419860186, pvalue=0.5624701260912525, stderr=178.940293876864)]

To be honest I've never seen what you are looking for implemented using scipy or statsmodels functionalities.

Therefore we can implement it ourselves exploiting numpy broadcasting:

x = array
y = np.array(yvalues.to_julian_date())

# mean of our inputs and outputs
x_mean = np.mean(x, axis=1)
y_mean = np.mean(y)

#total number of values
n = x.shape[1]

# using the formula to calculate the slope and intercept

n = np.sum((x - x_mean[:,np.newaxis]) * (y - y_mean)[np.newaxis,:], axis=1)
d = np.sum((x - x_mean[:,np.newaxis])**2, axis=1)

slopes = n/d
intercepts = y_mean - slopes*x_mean

slopes
array([329.14108704, 178.44888292, 106.86168939])

intercepts
array([2458842.41173136, 2458838.70569123, 2458840.76566176])

numpy vectorized approach to regression -multiple dependent columns (x) on single independent columns (y)

1 Answers1