How can I find the respective P-values for a multiple linear regression using the linear model from sklearn?

Question

So, I'm trying to develop a ml model for multiple linear regression that predicts the Y given n number of X variables. So far, my model can read in a data set and give the predicted value with a coefficient of determination as well as the respective coefficients for a 1-unit increase in X. The only issues are:

I can't get the p-value for the life of me, it says most of the time the data isn't shaped right due to it being 5 columns and 1329 rows. When I do get an output, they're just incorrect, I know because I did the regression in analysis toolpak in excel.
Is there a way to make the model recursive so that it recognizes the highest pvalue above .05 and calls itself again without said value until it hits the base case. Which would be something like

While dependent_v[pvalue] > .05:

Also what would be the best visualization method to show my data?

Thank you for any and all that help, I'm just starting to delve into machine learning on my own and want to hone my skills before an upcoming data science internship in the summer.

import matplotlib.pyplot as plt import pandas as pd from sklearn import linear_model

def multipleReg():

dfreg = pd.read_csv("dfreg.csv")

#Setting dependent variables
dependent_v = ['Large_size', 'Mid_Level', 'Senior_Level', 'Exec_Level', 'Company_Location']
#Setting independent variable
independent_v  = 'Salary_In_USD'

X = dfreg[dependent_v] #Drawing dependent variables from dataframe
y = dfreg[independent_v] #Drawing independent variable from dataframe

reg = linear_model.LinearRegression() #Initializing regression model
reg.fit(X.values,y) #Fitting appropriate values 

predicted_sal = reg.predict([[1,0,1,0,0]]) #Prediction using 2 dimensions in array

percent_rscore = (reg.score(X.values,y)*100) #Model coefficient of determination 

print('\n')

print("The predicted salary is:", predicted_sal)

print("The Coefficient of deterimination is:", "{:,.2f}%".format(percent_rscore))

#Printing coefficents of dependent variables(How much Y increases due to 1
#unit increase in X)
print("The corresponding coefficients for the dependent variables are:", reg.coef_)

score 1 · Answer 1 · answered Nov 29 '22 at 18:07

1

As far as i know sklearn doesn't return p values, is better using the statsmodels library.

But if you need to use sklearn anyway, you can find various solutions here:

Find p-value (significance) in scikit-learn LinearRegression

answered Nov 29 '22 at 18:07

Alvaricoque

23
6

How can I find the respective P-values for a multiple linear regression using the linear model from sklearn?

1 Answers1