Replacing rarely occuring values in pandas dataframe (by percentage)

Question

When I checked Similar Question Related to my problem I found a solution but I needed to change the condition as below:

def clean_variables(data):
    for column in cat_others:
        data.loc[(data[column].value_counts()/data[column].count())*100 < 0.01, cat_others] = "Rare"
     
        return data

X_train = clean_variables(X_train)

yet, I am having an indexing error as below:

(Missing    False
Grvl       False
Pave       False
Name: Alley, dtype: bool, ['Alley', 'RoofStyle', 'LandContour', 'HouseStyle', 'BldgType'])

I tried with '.iloc' function rather than '.loc' but I can not still work with it. List of column names represented in output is the list of cat_others column names. So operation will be handled in those columns but there is a problem with the loop condition.

Thanks for your concerns.

def rare_replacement(data):
    for col in cat_others:
        v_counts = data[col].value_counts()
        rare_vals = v_counts[(v_counts/v_counts.sum()) < 0.01]
        rare_vals_list = list(rare_vals.index)
    for i,v in enumerate(data[col]):
        if v in rare_vals_list:
            data.loc[i,col] = 'Rare'
        
        return data

X_train = rare_replacement(X_train)

Additionally tried this one but still cant get the expected result.

Overall it looks like it should work. I'm not sure why you're passing cat_others to the .loc though? Also you're going to want to take the return statement out of the for loop. It will currently return after finishing the first column — rayad, Jun 07 '22 at 13:54
for the cat_others to the .loc part, I wanted to replace the values in cat_others which suits the condition. yet, if what you say is "youre already checking the values in cat_others, so there is no replaceable condition match in other columns, so no need" its logical and i could not think through that. But could not understand what you share about return statement. Thanks btw. — Burchette, Jun 12 '22 at 09:00

Replacing rarely occuring values in pandas dataframe (by percentage)

0 Answers0