Identifying which variable occurs most in group with multiple columns

Question

I'm still new to R and I could use some help. So I have a dataset that looks something like this

a <- c("a", "b", "c", "d", "a", "d") 
E <- c(NA, "E", NA, "E", NA, "E")
F <- c(NA, "F", "F", "F", NA, NA)
G <- c("G", NA, "G", "G", "G", NA)

df <- data.frame (a, E, F, G)

I'm trying to find out which one of E, F, or G, occurs most per group when I group by a. My biggest issue seems to be that they are characters in three separate columns. I tried combining them into one column but it didn't work. I'm struggling to find answers after searching for hours and am now just confused at what should be an easy question I would think. Any help would be amazing. Thanks!

Edit: Sorry I'm very new to the site and am still getting the formatting down. So the correct output would ideally be something like.

  a   Mostcommon
  -   ----------
1  a     "G"
2  b    "E""F"
3  c    "F""G"
4  d     "E"

Using the example I gave. With my actual data there should only be one most common value per group.

Are these all in a data frame, something like `df <- data.frame(a, E, F, G)`? And are your `NA` values missing values (without quotes, `NA`) or strings with quotes `"NA"`? Could you show the expected output for this sample input? — Gregor Thomas, May 03 '22 at 14:59
So what exactly is the correct output for this input. Are these supposed to be columns in a data.frame or are they truly separate vectors? — MrFlick, May 03 '22 at 15:00

score 0 · Accepted Answer · edited May 03 '22 at 15:51

0

Is this what you'd like to do?

library(tidyverse)

tibble(
  a = c("a", "b", "c", "d", "a", "d"),
  E = c("NA", "E", "NA", "E", "NA", "E"),
  F = c("NA", "F", "F", "F", "NA", "NA"),
  G = c("G", "NA", "G", "G", "G", "NA")
) |> 
  mutate(across(E:G, ~if_else(is.na(.), 0, 1))) |> 
  group_by(a) |> 
  summarise(across(E:G, sum))
#> # A tibble: 4 × 4
#>   a         E     F     G
#>   <chr> <dbl> <dbl> <dbl>
#> 1 a         0     0     2
#> 2 b         1     1     0
#> 3 c         0     1     1
#> 4 d         2     1     1

^{Created on 2022-05-03 by the reprex package (v2.0.1)}

edited May 03 '22 at 15:51

Gregor Thomas

136,190
20
167
294

answered May 03 '22 at 15:13

Carl

4,232
2
12
24

I got this to work! I had to replace my NA values with 0 character values but then it worked great. I can total make this work for what I need. Thank you! – Clara W May 03 '22 at 15:38
Changed `== "NA"` to `is.na()` now that the question has been updated. – Gregor Thomas May 03 '22 at 15:51

score 0 · Answer 2 · answered May 03 '22 at 15:44

You could use the Modes function defined here. ie I copy oasted it over here

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

Now with the modes function, do the following:

df %>%
  pivot_longer(-a, values_drop_na = TRUE)%>%
  group_by(a) %>%
  summarize(most_common = toString(Modes(value)))

# A tibble: 4 x 2
  a     most_common
  <chr> <chr>      
1 a     G          
2 b     E, F       
3 c     F, G       
4 d     E

I tried this and it also worked, gave me the same answer as the above method. Thank you! — Clara W, May 03 '22 at 15:51

Identifying which variable occurs most in group with multiple columns

2 Answers2