This is my data
data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")
data_preprocessed.head()
Reason_1 Reason_2 Reason_3 Reason_4 Month Day of the week Transportation
Expense
0 0 0 0 1 7 1 289
1 0 0 0 0 7 1 118
2 0 0 0 1 7 2 179
3 1 0 0 0 7 3 279
4 0 0 0 1 7 3 289
this is half my data, sry can't upload all the data...
so i have no null values in my data here
i set all the values >3 to 1 and <3 to 0
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the
week', 'Distance to Work', 'Daily Work
Load Average'],axis=1)
size of my data data_with_targets.shape (700, 12)
my inputs
unscaled_inputs = data_with_targets.iloc[:, :-1]
order in which data is stored
order = unscaled_inputs.columns.values
array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
'Children', 'Pets'], dtype=object)
then i performed train_test_split
from sklearn.model_selection import train_test_split
train_test_split(unscaled_inputs,targets)
x_train,x_test,y_train,y_test = train_test_split(unscaled_inputs,targets, train_size =
0.8, random_state = 50)
then i split my data to into two dataframes dummies in one DF and numeric values in another DF, so that i can scale the numeric values
new_unscaled_inputs = x_train.loc[:,"Month":"Body Mass Index"]
new_unscaled_inputs_2 = x_train.loc[:,"Children":"Pets"]
dummy_1 = x_train.loc[:,"Reason_1":"Reason_4"]
dummy_2 = x_train.loc[:,"Education"]
concat both numeric variable columns
new_unscaled_var = pd.concat([new_unscaled_inputs,new_unscaled_inputs_2],axis=1)
no null values
new_unscaled_var.isnull().sum()
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
Scaled the numeric values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_new = scaler.fit_transform(new_unscaled_var)
scaled_new
scaled_df = pd.DataFrame(scaled_new, columns = ['Month', 'Transportation Expense',
'Age', 'Body Mass Index',
'Children', 'Pets'])
scaled_df.isnull().sum()
no null values
Month 0
Transportation Expense 0
Age 0
Body Mass Index 0
Children 0
Pets 0
dtype: int64
now i concatenated dummy col as one
dummy_df = pd.concat([dummy_1, dummy_2],axis=1)
still no null values
dummy_df.isnull().sum()
Reason_1 0
Reason_2 0
Reason_3 0
Reason_4 0
Education 0
dtype: int64
df.shape
dummy_df.shape
now when i concatenate the scaled_df col and dummy_df col i get 111 null values in every single column
scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()
Month 111
Transportation Expense 111
Age 111
Body Mass Index 111
Children 111
Pets 111
Reason_1 111
Reason_2 111
Reason_3 111
Reason_4 111
Education 111
dtype: int64
I don't understand why. Please help me understand this.