I need some help on Logistic Regression. Below is my data:
ID | Mach_1 | Mach_2 | Mach_3 | Mach_4 | Mach_5 | ..Mach300 | Rejected Unit (%) | Yield(%)
127189.11 1 0 1 1 1 0 0.23 98.0%
178390.11 0 0 0 1 0 0 0.10 90.0%
902817.11 1 0 1 0 1 0 0.60 94.0%
DSK1201.11 1 0 0 0 1 0 0.02 99.98%
I have about 300 mach cols and 2K rows. I want to predict for each machine how much the percentage of it contributes to the rejected unit. I want to know which machine is the one is the rejected unit.
I have done some of the coding however I face some error which I don't understand and how to solve it. Below is my code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
At first this is my code, then I am getting an error.
ValueError: endog must be in the unit interval
Then I scale my y(target variable), then I am getting another error which I don't know why and how to solve it.
This is my latest code after scale the data:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
df = pd.read_csv('Data.csv')
#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])
#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']
#scale target variable
from sklearn.preprocessing import MinMaxScaler
y_reshape = y.values.reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(y_reshape)
#change the numpy array of y_scale into dataframe
y = pd.DataFrame(y_scale)
#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())
Then I am getting the error :
Does anyone can help me with this ?