0

I need some help on Logistic Regression. Below is my data:

ID        | Mach_1  | Mach_2 | Mach_3  | Mach_4 | Mach_5 | ..Mach300 | Rejected Unit (%) | Yield(%)
127189.11     1         0        1         1        1          0            0.23             98.0%
178390.11     0         0        0         1        0          0            0.10             90.0%
902817.11     1         0        1         0        1          0            0.60             94.0%
DSK1201.11    1         0        0         0        1          0            0.02             99.98%

I have about 300 mach cols and 2K rows. I want to predict for each machine how much the percentage of it contributes to the rejected unit. I want to know which machine is the one is the rejected unit.

I have done some of the coding however I face some error which I don't understand and how to solve it. Below is my code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split

df = pd.read_csv('Data.csv')

#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])

#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']

#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())

At first this is my code, then I am getting an error.

ValueError: endog must be in the unit interval

Then I scale my y(target variable), then I am getting another error which I don't know why and how to solve it.

This is my latest code after scale the data:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split

df = pd.read_csv('Data.csv')

#Convert ID into numerical
le = LabelEncoder()
labelencoder.fit_transform(df[:,0])

#Separate target variable and other columns
X = df.drop('Rejected Unit (%)',1)
y = df['Rejected Unit (%)']

#scale target variable
from sklearn.preprocessing import MinMaxScaler
y_reshape = y.values.reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(y_reshape)
#change the numpy array of y_scale into dataframe
y = pd.DataFrame(y_scale)


#Split data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
#Get the coefficient for each features column
import statsmodels.api as sm
model = sm.Logit(y_train,X_train)
res = mod.fit()
print(res.summary())

Then I am getting the error :

enter image description here

Does anyone can help me with this ?

  • If your `Rejected Unit (%)` are in percent, then you just need to divide it by 100 to get fractions. – Josef Jan 16 '20 at 16:12
  • The second exception means most likely that your `X_train` is singular. You can check with `np.linalg.matrix_rank(X_train)`. – Josef Jan 16 '20 at 16:14
  • @Josef when i do the np.linalg.matrix(X_train) , the output shows 291. is that means it is singular matrix ? – NAJAA BAZILAH Jan 17 '20 at 00:55
  • If 291 is smaller than the number of columns, then it is singular. In the description you only mentioned around 300 columns. – Josef Jan 17 '20 at 01:43
  • @josef Yes, the columns only have 300 columns. How can I solve this singular matrix issue so that I am able to run the code ? – NAJAA BAZILAH Jan 17 '20 at 02:05
  • First, check whether the full data is also singular, or if it is just a consequence of the train-test split. If the full data is singular, then you have to find and drop collinear columns. One possibility is to use variance_inflation_factor to check for multicollinearity https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python – Josef Jan 17 '20 at 03:42

0 Answers0