1

I have a data frame like this:

id w1 w2 w3 w4 w5 w6
11 light light light light light light
22 light light light light medium medium
33 light light medium medium medium heavy
44 light light medium NA NA NA
55 light light medium medium NA NA
66 medium medium medium NA NA NA

I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.

The target df should look like this:

id w1 w2 w3 w4 w5 w6 N_light N_medium N_heavy final
11 light light light light light light 6 0 0 light
22 light light light light medium medium 4 2 0 light
33 light light medium medium medium heavy 2 3 1 medium
44 light light medium NA NA NA 2 1 0 light
55 light light medium medium NA NA 2 2 0 light
66 medium medium medium NA NA NA 0 3 0 medium

The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?

I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.

Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
New R user
  • 13
  • 3
  • Welcome to Stack Overflow! Can you please read and incorporate elements from [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1082435). Especially the aspects of using `dput()` for the input and then an explicit example of your expected dataset? – wibeasley Jan 26 '23 at 16:56
  • Is the [modeest](https://CRAN.R-project.org/package=modeest) package any faster? – wibeasley Jan 26 '23 at 16:57
  • As a rule, loops are slower than say, apply (or using functions in the purrr library). Another option is `dtplyr`: https://www.business-science.io/code-tools/2019/08/15/big-data-dtplyr.html – Pss Jan 26 '23 at 17:07
  • I've added a solution for how to get the mode – Chris Ruehlemann Jan 27 '23 at 10:00

3 Answers3

0

Here's a tidyverse solution:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA` with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  # reorder columns:
  select(id, everything())
  id     w1     w2     w3     w4 N_light N_medium N_heavy
1  1  light  light  light  light       4        0       0
2  2  light  light  light  light       4        0       0
3  3  light  light medium medium       2        2       0
4  4  light  light   <NA> medium       2        1       0
5  5  light  light medium medium       2        2       0
6  6 medium medium   <NA>  heavy       0        2       1

EDIT:

If the ultimate goal is to compute the mode for the three new columns, then this may be a way to go:

# First define a function for the mode:

getmode <- function(v) {
  uniqv <- unique(v[!is.na(v)])
  uniqv[which.max(table(match(v, uniqv)))]
}

# Second, do as before:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA`with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  select(id, everything()) %>%

  # Third, add to this the computation of the mode:
  
  # compute mode:
  summarise(across(starts_with("N"), ~getmode(.)))
  N_light N_medium N_heavy
1       2        2       0

Data:

df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light", 
                                      "light", "medium"), w2 = c("light", "light", "light", "light", 
                                                                 "light", "medium"), w3 = c("light", "light", "medium", NA, "medium", 
                                                                                            NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
                                                                                            )), class = "data.frame", row.names = c(NA, -6L))
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

in Base R you could do:

a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])

   id     w1     w2     w3     w4     w5     w6 heavy light medium  final
11 11  light  light  light  light  light  light     0     6      0  light
22 22  light  light  light  light medium medium     0     4      2  light
33 33  light  light medium medium medium  heavy     1     2      3 medium
44 44  light  light medium   <NA>   <NA>   <NA>     0     2      1  light
55 55  light  light medium medium   <NA>   <NA>     0     2      2 medium
66 66 medium medium medium   <NA>   <NA>   <NA>     0     0      3 medium
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

I know this asks for dplyr, but if other find base R useful you could simply index and use *apply functions

xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
 # or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)

df[newnames] <- sapply(xx, 
                       function(x) rowSums(df[,-1] == x, 
                                           na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]

Output:

  id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
1 11  light  light  light  light  light  light       6        0       0  light
2 22  light  light  light  light medium medium       4        2       0  light
3 33  light  light medium medium medium  heavy       2        3       1 medium
4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium
jpsmith
  • 11,023
  • 5
  • 15
  • 36