3

I have:

df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2), 
               char=c("a", "b", "z", "a", "c", "c", "b"),
               want=c("a", "a", "a", "a", "c", "c", "c"))
df

  group char want
1     1    a    a
2     1    b    a
3     1    z    a
4     1    a    a
5     2    c    c
6     2    c    c
7     2    b    c

I want to return the value thats the most occuring within the char variable by group. If there's a way to address ties, (i.e. if there were two a's and two b's in group 1 to say so), that would also be ideal.

Thanks

biostatguy12
  • 649
  • 3
  • 8

4 Answers4

2

We can use the Mode function from here to get the most frequent value, do a group by 'group', and apply the function on the 'char' column

library(dplyr)
Mode <- function(x) {
   ux <- unique(x)
   tabl <- tabulate(match(x, ux))
   i1 <- tabl == max(tabl)
   stopifnot(sum(i1) == 1)
   toString(ux[i1])
   }

df %>%
   group_by(group) %>%
   mutate(want = Mode(char))
# A tibble: 7 x 3
# Groups:   group [2]
#  group char  want 
#  <dbl> <fct> <fct>
#1     1 a     a    
#2     1 b     a    
#3     1 z     a    
#4     1 a     a    
#5     2 c     c    
#6     2 c     c    
#7     2 b     c    

testing with ties

df$char[2] <- 'z'

df %>%
   group_by(group) %>%
   mutate(want = Mode(char))

Error: `mutate()` argument `want` errored.
ℹ `want` is `Mode(char)`.
ℹ The error occured in group 1: group = 1.
✖ sum(i1) == 1 is not TRUE
akrun
  • 874,273
  • 37
  • 540
  • 662
2

The function max.col has a method for dealing with ties, so you could make each group into a count matrix and use max.col

df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2), 
               char=c("a", "b", "z", "a", "c", "c", "b"),
               want=c("a", "a", "a", "a", "c", "c", "c"))

df$want1 <- ave(df$char, df$group, FUN = function(x) {
  x <- t(table(x))
  # print(x)
  colnames(x)[max.col(x, ties.method = 'first')]
})

df

#   group char want want1
# 1     1    a    a     a
# 2     1    b    a     a
# 3     1    z    a     a
# 4     1    a    a     a
# 5     2    c    c     c
# 6     2    c    c     c
# 7     2    b    c     c

For each unique group, you are finding the column index with the maximum value, then taking the column name:

## group - 1
     a b c z
[1,] 2 1 0 1

## group - 2
     a b c z
[1,] 0 1 2 0
rawr
  • 20,481
  • 4
  • 44
  • 78
1

One dplyr option could be:

df %>%
 add_count(group, char) %>%
 group_by(group) %>%
 mutate(want = toString(unique(char[n == max(n)]))) %>%
 select(-n)

  group char  want 
  <dbl> <fct> <chr>
1     1 a     a    
2     1 b     a    
3     1 z     a    
4     1 a     a    
5     2 c     c    
6     2 c     c    
7     2 b     c 

You can then check for ties:

df %>%
 add_count(group, char) %>%
 group_by(group) %>%
 mutate(want = toString(unique(char[n == max(n)])),
        ties = grepl(",", want, fixed = TRUE)) %>%
 select(-n)

  group char  want  ties 
  <dbl> <fct> <chr> <lgl>
1     1 a     a     FALSE
2     1 z     a     FALSE
3     1 b     a     FALSE
4     1 a     a     FALSE
5     2 c     c     FALSE
6     2 c     c     FALSE
7     2 b     c     FALSE
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
1

Here are 2 options using data.table:

1) without using by and carrying forward last observation for NA_character_ as nafill is not supporting character class for now.

library(data.table)
setDT(df)[order(group, -rowid(group, char)), 
    w1 := char[nafill(replace(.I, 1L + which(diff(group) == 0L), NA_integer_), "locf")]
]

This one should be faster if you have a large dataset.

2) Or equivalently, using by to make it more readable:

setDT(df)[order(group, -rowid(group, char)), w2 := char[1L], group]

output:

   group char want w
1:     1    a    a a
2:     1    b    a a
3:     1    z    a a
4:     1    a    a a
5:     2    c    c c
6:     2    c    c c
7:     2    b    c c
chinsoon12
  • 25,005
  • 4
  • 25
  • 35