Rare Label Encoding by Proportion in R

Question

df

n = c(2, 3, 5, 8, 10, 12) 
s = c("aa", "bb", "cc", "aa", "bb","aa") 
b = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE) 
df = data.frame(n, s, b)

I want to calculate the proportion of "s" and then replace the category with "rare" if proportion in "s" is below 20%.

Result:

    n   s       b
1   2   aa      TRUE
2   3   bb      FALSE
3   5   rare    TRUE
4   8   aa      FALSE
5   10  bb      TRUE
6   12  aa      FALSE

I've been able to find how to calculate a proportion but not how to use that proportion to replace a categorical value.

  mtcars %>%
  count(am, gear) %>%
  group_by(am) %>%
  mutate(freq = n / sum(n))

score 0 · Accepted Answer · answered Aug 12 '20 at 13:53

You can calculate the proportion by table and prop.table and then replace those values of s where proportion is less than 0.2 with 'rare'.

df$s[df$s %in% names(Filter(function(x) x < 0.2, 
         prop.table(table(df$s))))] <- 'rare'

df
#   n    s     b
#1  2   aa  TRUE
#2  3   bb FALSE
#3  5 rare  TRUE
#4  8   aa FALSE
#5 10   bb  TRUE
#6 12   aa FALSE

score 0 · Answer 2 · answered Aug 12 '20 at 13:57

Also you can try with dplyr:

library(dplyr)
#Data
n = c(2, 3, 5, 8, 10, 12) 
s = c("aa", "bb", "cc", "aa", "bb","aa") 
b = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE) 
df = data.frame(n, s, b, stringsAsFactors = F)
#Mutate
df %>% group_by(s) %>% mutate(I=n()/nrow(df),s=ifelse(I<0.20,'rare',s)) %>% select(-I)

Output:

# A tibble: 6 x 3
# Groups:   s [3]
      n s     b    
  <dbl> <chr> <lgl>
1     2 aa    TRUE 
2     3 bb    FALSE
3     5 rare  TRUE 
4     8 aa    FALSE
5    10 bb    TRUE 
6    12 aa    FALSE

score 0 · Answer 3 · answered Aug 12 '20 at 14:06

You can do it without dplyr with standard functions of R. First, create a table of frequencies, then filter it to remove all entries with a frequency below 0.2 and then use this table to reset the values of column s:

f=c(table(s))
f=f[f/sum(f)>.2]
df$s[!df$s %in% names(f)]="rare"

Rare Label Encoding by Proportion in R

3 Answers3