7

Define:

df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)

s.t.

> df1
  id v1
1  1  a
2  1  b
3  1  b
4  2  c
5  2  c
6  2  c

I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.

> df2
  id v1 freq
1  1  a    b
2  1  b    b
3  1  b    b
4  2  c    c
5  2  c    c
6  2  c    c
Fred
  • 1,833
  • 3
  • 24
  • 29

3 Answers3

3

Another way consists of using tidyverse functions:

  • grouping first, using group_by(), and counting the occurrence of the second variable using tally()
  • arranging by the number of occurrences with arrange()
  • summarizing and picking out the first row with summarize() and first()

Therefore:

df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))

This will give you just the mapping (which I find cleaner):

# A tibble: 2 x 2
     id   freq
  <dbl> <fctr>
1     1      b
2     2      c

You can then left_join your original data frame with that table.

slhck
  • 36,575
  • 28
  • 148
  • 201
  • I like that approach because one can check for and identify ties after `tally()`. That might be possible with @joran's great function too but not so straight forward as here, at least for me – tjebo Mar 08 '18 at 09:55
3

You can do this using ddply and a custom function to pick out the most frequent value:

myFun <- function(x){
    tbl <- table(x$v1)
    x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
    x
}

ddply(df1,.(id),.fun=myFun)

Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.

joran
  • 169,992
  • 32
  • 429
  • 468
1
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
  id v1 freq
1  1  a    b
2  1  b    b
3  1  b    b
4  2  c    c
5  2  c    c
6  2  c    c
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I think `df2` is a typo, and when I run this I get `NA`s for `id`=2. – joran Jun 28 '11 at 22:13
  • The typo is gone, but I still don't think this code works. When id=2, max(table(x)) returns 3, but table(x) has only 1 name, so your function mode returns NA. – joran Jun 28 '11 at 23:13
  • It is accidentally giving the correct result, because of an accident of factors. df$id is a factor and the 3rd level is "c". Fixed. – IRTFM Jun 29 '11 at 01:20