When I checked Similar Question Related to my problem I found a solution but I needed to change the condition as below:
def clean_variables(data):
for column in cat_others:
data.loc[(data[column].value_counts()/data[column].count())*100 < 0.01, cat_others] = "Rare"
return data
X_train = clean_variables(X_train)
yet, I am having an indexing error as below:
(Missing False
Grvl False
Pave False
Name: Alley, dtype: bool, ['Alley', 'RoofStyle', 'LandContour', 'HouseStyle', 'BldgType'])
I tried with '.iloc' function rather than '.loc' but I can not still work with it. List of column names represented in output is the list of cat_others column names. So operation will be handled in those columns but there is a problem with the loop condition.
Thanks for your concerns.
def rare_replacement(data):
for col in cat_others:
v_counts = data[col].value_counts()
rare_vals = v_counts[(v_counts/v_counts.sum()) < 0.01]
rare_vals_list = list(rare_vals.index)
for i,v in enumerate(data[col]):
if v in rare_vals_list:
data.loc[i,col] = 'Rare'
return data
X_train = rare_replacement(X_train)
Additionally tried this one but still cant get the expected result.