0

I have the following dataset which I'm trying to trim (sample of the dataframe is below)

Index Grade
Ace_1_1 A
Ace_1_1 A
Ace_1_1 B
Ace_1_1 C
Ace_1_2 A
Ace_1_2 C
Ace_1_2 C
Ace_1_3 B
Ace_1_3 B
Ace_2_2 C
Ace_2_2 A
Ace_2_2 B
Ace_2_5 C
Ace_2_5 C
Ace_2_5 A
Ace_2_5 A
Ace_3_1 D
df2 = structure(list(Index = c("Ace_1_1", "Ace_1_1", "Ace_1_1", "Ace_1_1", 
                               "Ace_1_2", "Ace_1_2", "Ace_1_2", "Ace_1_3", "Ace_1_3", "Ace_2_2", 
                               "Ace_2_2", "Ace_2_2", "Ace_2_5", "Ace_2_5", "Ace_2_5", "Ace_2_5", 
                               "Ace_3_1"), Grade = c("A", "A", "B", "C", "A", "C", "C", "B", 
                                "B", "C", "A", "B", "C", "C", "A", "A", "D")), class = "data.frame", row.names = c(NA,-17L))

I'm trying to trim it down such that for each index, only the most common grade would show. If there is a tie in frequency, the tied grades would be shown. If there's only one entry the entry would be show as well. The ideal output for the table above would be

Index Grade
Ace_1_1 A
Ace_1_2 C
Ace_1_3 B
Ace_2_2 C
Ace_2_2 A
Ace_2_2 B
Ace_2_5 C
Ace_2_5 A
Ace_3_1 D

I'm using the group_by() function by counting the frequency with mutate (Frequency = n()) but I'm having trouble subsetting the most frequent count. Appreciate if anyone could help me count here either using dplyr or even base R, thanks!

Luther_Proton
  • 348
  • 1
  • 7

2 Answers2

2

You may calculate the statistical mode for each Index and keep rows that only occur most frequently. You can then use distinct to keep only the unique ones.

library(dplyr)

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

df2 %>%
  filter(Grade %in% Modes(Grade), .by = Index) %>%
  distinct(Index, Grade)

#    Index Grade
#1 Ace_1_1     A
#2 Ace_1_2     C
#3 Ace_1_3     B
#4 Ace_2_2     C
#5 Ace_2_2     A
#6 Ace_2_2     B
#7 Ace_2_5     C
#8 Ace_2_5     A
#9 Ace_3_1     D
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Using data.table or dplyr, you can simply take the rows in each group where the count equals the max count:

library(dplyr)
count(df2,Index,Grade) %>% filter(n==max(n), .by=Index) %>% select(-n)

OR

library(data.table)
setDT(df2)[, .N, .(Index, Grade)][, .SD[N==max(N), Grade], Index]

Output:

     Index  Grade
    <char> <char>
1: Ace_1_1      A
2: Ace_1_2      C
3: Ace_1_3      B
4: Ace_2_2      A
5: Ace_2_2      B
6: Ace_2_2      C
7: Ace_2_5      A
8: Ace_2_5      C
9: Ace_3_1      D
langtang
  • 22,248
  • 1
  • 12
  • 27