0

This is my data

data_preprocessed = pd.read_csv("Absenteeism_preprocessed.csv")
data_preprocessed.head()   

  Reason_1    Reason_2   Reason_3    Reason_4   Month   Day of the week  Transportation 
                                                                          Expense
0    0            0          0           1        7            1             289
1    0            0          0           0        7            1             118
2    0            0          0           1        7            2             179    
3    1            0          0           0        7            3             279
4    0            0          0           1        7            3             289

this is half my data, sry can't upload all the data...

so i have no null values in my data here

i set all the values >3 to 1 and <3 to 0

targets = np.where(data_preprocessed['Absenteeism Time in Hours'] >
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

 data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the 
                                              week', 'Distance to Work', 'Daily Work 
                                              Load Average'],axis=1)

size of my data data_with_targets.shape (700, 12)

my inputs

unscaled_inputs = data_with_targets.iloc[:, :-1]

order in which data is stored

order = unscaled_inputs.columns.values
array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
   'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
   'Children', 'Pets'], dtype=object)

then i performed train_test_split

from sklearn.model_selection import train_test_split
train_test_split(unscaled_inputs,targets)
x_train,x_test,y_train,y_test = train_test_split(unscaled_inputs,targets, train_size = 
0.8, random_state = 50)

then i split my data to into two dataframes dummies in one DF and numeric values in another DF, so that i can scale the numeric values

new_unscaled_inputs = x_train.loc[:,"Month":"Body Mass Index"]
new_unscaled_inputs_2 = x_train.loc[:,"Children":"Pets"]
dummy_1 = x_train.loc[:,"Reason_1":"Reason_4"]
dummy_2 = x_train.loc[:,"Education"]

concat both numeric variable columns

new_unscaled_var = pd.concat([new_unscaled_inputs,new_unscaled_inputs_2],axis=1)

no null values

new_unscaled_var.isnull().sum()
Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
dtype: int64

Scaled the numeric values

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_new = scaler.fit_transform(new_unscaled_var)
scaled_new
scaled_df = pd.DataFrame(scaled_new, columns = ['Month', 'Transportation Expense',
                                            'Age', 'Body Mass Index',
                                            'Children', 'Pets'])
scaled_df.isnull().sum()

no null values

Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
dtype: int64

now i concatenated dummy col as one

dummy_df = pd.concat([dummy_1, dummy_2],axis=1)

still no null values

dummy_df.isnull().sum()

Reason_1     0
Reason_2     0
Reason_3     0
Reason_4     0
Education    0
dtype: int64
df.shape
dummy_df.shape

now when i concatenate the scaled_df col and dummy_df col i get 111 null values in every single column

scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()

Month                     111
Transportation Expense    111
Age                       111
Body Mass Index           111
Children                  111
Pets                      111
Reason_1                  111
Reason_2                  111
Reason_3                  111
Reason_4                  111
Education                 111
dtype: int64

I don't understand why. Please help me understand this.

EON
  • 27
  • 8
  • 2
    What's `print(scaled_df.index.difference(dummy_df.index).empty)`? Is it `False`? Then there's an index mismatch between the two. If you'd like to ignore that, you can go for `scaled_inputs = pd.concat([scaled_df, dummy_df], axis=1, ignore_index=True)`. –  Jan 07 '22 at 07:58
  • Thanks for the suggestion, what you said is True, the index didn't match, but ignore_index = True didn't work, i found a work around on a different thread which said to use reset_index = True before concatenating the two columns, i'll post it below. Thanks!!!! – EON Jan 07 '22 at 09:52

1 Answers1

0

I got answer from other stackoverflow thread to use "reset_index = True", Link here

scaled_df.reset_index(drop = True)

   Month  Transportation Expense  Age  Body Mass Index Children   Pets
0  0.454628     -0.996453     -1.131382  -1.084160    -0.925174   -0.595121
1 -1.261719     -0.666968     -0.977228  -1.771995    -0.925174   -0.595121
2  0.740685      0.171723      1.026777   2.584296    -0.016231   -0.595121
3  1.598859      0.366419      1.643394   1.208625     0.892711   0.223719
4 -0.403546     -1.026407     -0.360611  -0.396324     0.892711  -0.595121

dummy_df.reset_index(drop = True)

    Reason_1    Reason_2    Reason_3    Reason_4    Education
0      0           0           1           0            0
1      0           0           0           1            1
2      0           0           0           0            0
3      0           0           0           1            0
4      0           0           0           1            0

it worked Phenomenally here!!!!!

scaled_inputs = pd.concat([scaled_df, dummy_df],axis = 1)
scaled_inputs.isnull().sum()

Month                     0
Transportation Expense    0
Age                       0
Body Mass Index           0
Children                  0
Pets                      0
Reason_1                  0
Reason_2                  0
Reason_3                  0
Reason_4                  0
Education                 0
dtype: int64
EON
  • 27
  • 8