I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column
Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'
IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.
EG:
0 Brighton
1 Yokohama
2 Levin
3 Melbourne
4 Coffeyville
5 Whakatane
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Te Awamutu
11 Bishkek
12 Melbourne
13 Whanganui
14 Coffeyville
15 New York
16 Brisbane
17 Greymouth
18 Brisbane
19 Chuo City
20 Accra
21 Levin
22 Waiouru
23 Brisbane
24 New York
25 Chuo City
26 Lucerne
27 Whanganui
28 Los Angeles
29 Melbourne
df['city'].head(30).value_counts(ascending=False, normalize=True)*100
Melbourne 16.666667
Levin 10.000000
Brisbane 10.000000
Whanganui 6.666667
Coffeyville 6.666667
New York 6.666667
Chuo City 6.666667
Waiouru 3.333333
Greymouth 3.333333
Te Awamutu 3.333333
Bishkek 3.333333
Lucerne 3.333333
Ashburn 3.333333
Yokohama 3.333333
Whakatane 3.333333
Accra 3.333333
Brighton 3.333333
Los Angeles 3.333333
From Ashburn down - it should be renamed to 'other'
I have tried the below which is a start, but not exactly what I want:
city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)
def count(df):
if df["city_count"] > 10:
return "High"
elif df["city_count"] < 0:
return "Medium"
else:
return "Low"
df.apply(count, axis=1)
I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this