0

I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column

Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'

IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.

EG:

0        Brighton
1        Yokohama
2           Levin
3       Melbourne
4     Coffeyville
5       Whakatane
6       Melbourne
7       Melbourne
8           Levin
9         Ashburn
10     Te Awamutu
11        Bishkek
12      Melbourne
13      Whanganui
14    Coffeyville
15       New York
16       Brisbane
17      Greymouth
18       Brisbane
19      Chuo City
20          Accra
21          Levin
22        Waiouru
23       Brisbane
24       New York
25      Chuo City
26        Lucerne
27      Whanganui
28    Los Angeles
29      Melbourne

df['city'].head(30).value_counts(ascending=False, normalize=True)*100

Melbourne      16.666667
Levin          10.000000
Brisbane       10.000000
Whanganui       6.666667
Coffeyville     6.666667
New York        6.666667
Chuo City       6.666667
Waiouru         3.333333
Greymouth       3.333333
Te Awamutu      3.333333
Bishkek         3.333333
Lucerne         3.333333
Ashburn         3.333333
Yokohama        3.333333
Whakatane       3.333333
Accra           3.333333
Brighton        3.333333
Los Angeles     3.333333

From Ashburn down - it should be renamed to 'other'

I have tried the below which is a start, but not exactly what I want:

city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)

def count(df):
    if df["city_count"] > 10:
        return "High"
    elif df["city_count"] < 0:
        return "Medium"
    else:
        return "Low"

df.apply(count, axis=1)

I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this

AdrianC
  • 383
  • 4
  • 18
  • Not sure to understand what you want. Could you add your expected result? Maybe this old question of mine can be of help? https://stackoverflow.com/questions/55727891/pandas-dataframe-how-to-aggregate-a-subset-of-rows-based-on-value-of-a-column – Valentino Jul 14 '19 at 11:54
  • Not quite. What I would like is [using the above table]: if sum 65% of the values in that column (EG: Auckland, Christchurch, Wellington - which equals about 65.6% of the data) then keep those values in those rows - all other cities are then renamed to 'Other' – AdrianC Jul 14 '19 at 12:04
  • How do we have to see the column `normalized`? Is that the column we have to take the sum of or are that percentages? – Erfan Jul 14 '19 at 12:07
  • Normalized represents the count of that column as a percentage. There are over 5,000 cities so I didnt include them all. If the said function sums up the first 65% then returns those values as the city name, the remaining 35% should be named 'Other'. Is that even possible or am I asking for too much? – AdrianC Jul 14 '19 at 12:11
  • Sorry, I still not get what you want. just rename each city name to 'others'? No grouping 'others' into a single row? – Valentino Jul 14 '19 at 12:56
  • The idea was to not individually rename each value in the column. Some column's have over 7,000 unique values. For values that have counts of say less than 10, I just want to rename to "other" – AdrianC Jul 14 '19 at 12:59

1 Answers1

3

We can groupby on city and get the size of each city. We divide those values by the length of our dataframe with len and calculate the cumsum. Last step is to check from which point we exceed the threshold, so we can broadcast the boolean series back to your dataframe with map.

threshold = 0.7
m = df['city'].map(df.groupby('city')['city'].size().sort_values(ascending=False).div(len(df)).cumsum().le(threshold))

df['city'] = np.where(m, df['city'], 'Other')
          city
0         Other
1         Other
2         Levin
3     Melbourne
4   Coffeyville
5         Other
6     Melbourne
7     Melbourne
8         Levin
9       Ashburn
10        Other
11      Bishkek
12    Melbourne
13        Other
14  Coffeyville
15     New York
16     Brisbane
17        Other
18     Brisbane
19    Chuo City
20        Other
21        Levin
22        Other
23     Brisbane
24     New York
25    Chuo City
26        Other
27        Other
28        Other
29    Melbourne

old method

If I understand you correctly you want calculate a cumulative sum with .cumsum and check when it exceeds your set threshold.

Then we use np.where to conditionally fill in the City name or Other.

threshold = 80

m  = df['Normalized'].cumsum().le(threshold)

df['City'] = np.where(m, df['City'], 'Other')
            City  Normalized
0       Auckland   40.399513
1   Christchurch   13.130783
2     Wellington   12.267604
3       Hamilton    4.026242
4       Tauranga    3.867353
5      (not set)    3.540075
6        Dunedin    2.044508
7          Other    1.717975
8          Other    1.632849
9          Other    1.520342
10         Other    1.255651
11         Other    1.173878
12         Other    1.040508
13         Other    0.988166
14         Other    0.880502
15         Other    0.766877
16         Other    0.601468
17         Other    0.539067
18         Other    0.471824
19         Other    0.440903
20         Other    0.440344
21         Other    0.405884
22         Other    0.365836
23         Other    0.321131
24         Other    0.306602
25         Other    0.280524
26         Other    0.237123
27         Other    0.207878
28         Other    0.186084
29         Other    0.167085
30         Other    0.163732
31         Other    0.154977

Note: this method assumed that your Normalized column is sorted descending.

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Just to expand may be - `temp = (df['City'].value_counts(ascending=False, normalize=True)*100) dict(zip(temp.index.values, np.where( temp.cumsum() > 80, 'Other', temp.index)))` – Vivek Kalyanarangan Jul 14 '19 at 12:18
  • I think `df` provided here is the `value_counts` output that needs to be used to map back to the original `df` – Vivek Kalyanarangan Jul 14 '19 at 12:18
  • 1
    Ye kinda expect something like that. So im waiting for OP's response. @VivekKalyanarangan . I dont want to play the guessing game – Erfan Jul 14 '19 at 12:19
  • This is actually very close but I received the following error: `ValueError: operands could not be broadcast together with shapes (1706,) (300303,) ()`. I understand the error - I've created a dict of the above normalized value counts and applied it to each row but the cumulative part becomes redundant. I have 5,000 cities but 2,000,000 rows so mapping the normalized count wont work - unless im missing something glaringly obvious – AdrianC Jul 14 '19 at 12:26
  • Then it's actually what I expected, you provided an aggregated data set (your `value_counts`) instead of your actual data. I need a sample of your actual data instead of the aggregated one. @AdrianC – Erfan Jul 14 '19 at 12:28
  • I have updated the original post - you're right - I should have been clearer with my request - apologies for this. Let me know if this helps – AdrianC Jul 14 '19 at 12:37