How to create a new column based on the mode of some other columns in a data frame in R

Question

I have a data frame like this:

id	w1	w2	w3	w4	w5	w6
11	light	light	light	light	light	light
22	light	light	light	light	medium	medium
33	light	light	medium	medium	medium	heavy
44	light	light	medium	NA	NA	NA
55	light	light	medium	medium	NA	NA
66	medium	medium	medium	NA	NA	NA

I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.

The target df should look like this:

id	w1	w2	w3	w4	w5	w6	N_light	N_medium	N_heavy	final
11	light	light	light	light	light	light	6	0	0	light
22	light	light	light	light	medium	medium	4	2	0	light
33	light	light	medium	medium	medium	heavy	2	3	1	medium
44	light	light	medium	NA	NA	NA	2	1	0	light
55	light	light	medium	medium	NA	NA	2	2	0	light
66	medium	medium	medium	NA	NA	NA	0	3	0	medium

The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?

I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.

Welcome to Stack Overflow! Can you please read and incorporate elements from [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1082435). Especially the aspects of using `dput()` for the input and then an explicit example of your expected dataset? — wibeasley, Jan 26 '23 at 16:56
Is the [modeest](https://CRAN.R-project.org/package=modeest) package any faster? — wibeasley, Jan 26 '23 at 16:57
As a rule, loops are slower than say, apply (or using functions in the purrr library). Another option is `dtplyr`: https://www.business-science.io/code-tools/2019/08/15/big-data-dtplyr.html — Pss, Jan 26 '23 at 17:07

Chris Ruehlemann · Answer 1 · 2023-01-27T10:04:59.790

Here's a tidyverse solution:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA` with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  # reorder columns:
  select(id, everything())
  id     w1     w2     w3     w4 N_light N_medium N_heavy
1  1  light  light  light  light       4        0       0
2  2  light  light  light  light       4        0       0
3  3  light  light medium medium       2        2       0
4  4  light  light   <NA> medium       2        1       0
5  5  light  light medium medium       2        2       0
6  6 medium medium   <NA>  heavy       0        2       1

EDIT:

If the ultimate goal is to compute the mode for the three new columns, then this may be a way to go:

# First define a function for the mode:

getmode <- function(v) {
  uniqv <- unique(v[!is.na(v)])
  uniqv[which.max(table(match(v, uniqv)))]
}

# Second, do as before:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA`with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  select(id, everything()) %>%

  # Third, add to this the computation of the mode:
  
  # compute mode:
  summarise(across(starts_with("N"), ~getmode(.)))
  N_light N_medium N_heavy
1       2        2       0

Data:

df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light", 
                                      "light", "medium"), w2 = c("light", "light", "light", "light", 
                                                                 "light", "medium"), w3 = c("light", "light", "medium", NA, "medium", 
                                                                                            NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
                                                                                            )), class = "data.frame", row.names = c(NA, -6L))

score 0 · Answer 2 · answered Jan 26 '23 at 18:16

in Base R you could do:

a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])

   id     w1     w2     w3     w4     w5     w6 heavy light medium  final
11 11  light  light  light  light  light  light     0     6      0  light
22 22  light  light  light  light medium medium     0     4      2  light
33 33  light  light medium medium medium  heavy     1     2      3 medium
44 44  light  light medium   <NA>   <NA>   <NA>     0     2      1  light
55 55  light  light medium medium   <NA>   <NA>     0     2      2 medium
66 66 medium medium medium   <NA>   <NA>   <NA>     0     0      3 medium

jpsmith · Accepted Answer · 2023-01-26T18:42:42.173

I know this asks for dplyr, but if other find base R useful you could simply index and use *apply functions

xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
 # or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)

df[newnames] <- sapply(xx, 
                       function(x) rowSums(df[,-1] == x, 
                                           na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]

Output:

  id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
1 11  light  light  light  light  light  light       6        0       0  light
2 22  light  light  light  light medium medium       4        2       0  light
3 33  light  light medium medium medium  heavy       2        3       1 medium
4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium

How to create a new column based on the mode of some other columns in a data frame in R

3 Answers3