1

I am trying to create a new variable that is based on the model number of observations of a separate variable.

Using this df:

help <- data.frame(
id = c(rep(05, times = 8), rep(10, times = 8), rep(12, times = 8)),
episode = c(rep(1, times = 4), rep(2, times =4), rep(3, times = 8), rep(1, times = 4), rep(2, times =4)),
provider = c(rep(70, times = 2), rep(80, times = 2), rep(70, times = 4), rep(30, times = 6), rep(40, times = 2), rep(70, times = 4), rep(10, times = 4)))

I am hoping to create a new variable, provider_mode that is based on the modal provider, or provider with the most observations, per episode.

The end df would look like this:

id episode provider provider_mode
5       1       70       70
5       1       70       70
5       1       80       70
5       1       80       70
5       2       70       70 
5       2       70       70
5       2       70       70
5       2       70       70
10      3       30       30
10      3       30       30
10      3       30       30
10      3       30       30
10      3       30       30
10      3       30       30
10      3       40       30
10      3       40       30
12      1       70       70
12      1       70       70
12      1       70       70
12      1       70       70
12      2       10       10
12      2       10       10
12      2       10       10
12      2       10       10

Here is the code I came up with thus far, but it only gives me the count for each provider within each episode. I need to create a mutate command that puts the provider with the most observations, and if there is a tie, select the first provider (e.g., provider 70 within id 5).

help %>% group_by(id, episode, provider) %>% mutate(provider_count = n()) 
b222
  • 966
  • 1
  • 9
  • 19

3 Answers3

2

The problem is R doesn't really have a mode() function that calculates the statistical mode. We can fake it with dplyr() by calculating the counts and then merging them back in. For example if your data is stored in a variable named dd,

dd %>% 
    group_by(id, episode) %>% 
    count(provider) %>% 
    top_n(1, n) %>% 
    filter(row_number()==1) %>%
    select(-n) %>% 
    rename(provider_mode=provider) %>% 
    right_join(dd)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • I didn't realize `top_n` would return multiple rows in case of ties. I've added an additional filter to return just one value per id. – MrFlick May 17 '17 at 02:41
  • this works as long as I due `group_by(id, episode)` ... if I don't' group by episode as well, the provider for id 12 stays the same across both episodes. thanks, flick. – b222 May 17 '17 at 02:46
2

With dplyr we can group by id and episode and using table we find the frequency of each provider and select the maximum of it with which.max.

library(dplyr)
help %>%
  group_by(id, episode) %>%
  mutate(provider_mode = names(which.max(table(provider))))

#     id  episode provider provider_mode
#   <dbl>   <dbl>    <dbl>         <chr>
#1      5       1       70            70
#2      5       1       70            70
#3      5       1       80            70
#4      5       1       80            70
#5      5       2       70            70
#6      5       2       70            70
#7      5       2       70            70
#8      5       2       70            70
#9     10       3       30            30
#10    10       3       30            30
# ... with 14 more rows

And the similar could be achieved with ave from base R

help$provider_mode <- ave(help$provider, help$id, help$episode, FUN = function(x) 
                                      names(which.max(table(x)))) 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

We can create a Mode function as showed here

Mode <- function(x) {
 ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]

}

and then with data.table, grouped by 'id', 'episode', we get the 'Mode' of 'provider'

library(data.table)
setDT(help)[, provider_mode := Mode(provider), by = .(id, episode)]
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662