How to choose the most common value in a group related to other group in R?

Question

I have in R the following data frame:

ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
             "E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
        "CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY

DATA = data.frame(ID,VAR,CATEGORY);DATA

That looks like this table below :

ID	VAR	CATEGORY
1	A	ANE
1	A	ANE
1	A	ANA
1	A	ANB
1	B	ANE
2	C	BOO
2	C	BOA
2	D	BOO
3	E	CAT
3	E	CAT
4	F	DOG
4	A	ANE
4	B	ANE
4	F	DOG
4	C	FUT
4	F	DOG

ideal output given the above data frame in R I want to be like that:

ID	TEXTS	category
1	A	ANE
2	C	BOO
3	E	CAT
4	F	DOG

More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.

How can I do it in R ? Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.

score 6 · Accepted Answer · answered Sep 09 '21 at 20:07

Another dplyr strategy using count and slice:

library(dplyr)
DATA %>% 
    group_by(ID) %>% 
    count(VAR, CATEGORY) %>% 
    slice(which.max(n)) %>% 
    select(-n)

     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOA     
3     3 E     CAT     
4     4 F     DOG

score 3 · Answer 2 · answered Sep 09 '21 at 19:50

dplyr

library(dplyr)
DATA %>%
  group_by(ID) %>%
  filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
  group_by(ID, VAR) %>%
  summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
  ungroup()
# # A tibble: 4 x 3
#      ID VAR   CATEGORY
#   <dbl> <chr> <chr>   
# 1     1 A     ANE     
# 2     2 C     BOA     
# 3     3 E     CAT     
# 4     4 F     DOG

Data

DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))

score 3 · Answer 3 · answered Sep 09 '21 at 19:58

3

We could modify the Mode to return the index and use that in slice after grouping by 'ID'

Modeind <- function(x) {
   ux <- unique(x)
   which.max(tabulate(match(x, ux)))
   }
library(dplyr)
DATA %>%
     group_by(ID) %>%
     slice(Modeind(VAR)) %>%
     ungroup

-output

# A tibble: 4 x 3
     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOO     
3     3 E     CAT     
4     4 F     DOG

answered Sep 09 '21 at 19:58

akrun

874,273
37
540
662

.@akrun, yours is the only one, which reflects OPs ouptput. In ID 2 which one should be choosen `BOA` or `BOO`. Both are possible? – TarJae Sep 09 '21 at 20:17
1

@TarJae In the `Modeind` function it is jus selecting the first occurrence in case of ties. I don't know whether they need to sample in that case or not – akrun Sep 09 '21 at 20:19
1

@TarJae I noticed that. In those cases when there is not a clear order (other than as-presented), I tend to infer ties don't matter much. If the OP needs something else, they'll say it or it will come out in spiral-development. – r2evans Sep 09 '21 at 20:20

ThomasIsCoding · Answer 4 · 2021-09-09T22:23:25.773

3

A base R option with nested subset + ave

subset(
  subset(
    DATA,
    !!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
  ),
  !!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)

gives

   ID VAR CATEGORY
1   1   A      ANE
6   2   C      BOO
9   3   E      CAT
11  4   F      DOG

Explanation

The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)

edited Sep 09 '21 at 22:23

answered Sep 09 '21 at 21:43

ThomasIsCoding

96,636
9
24
81

Could you please explain with your words the logic. Is just for me to better memorize. Thank you. – TarJae Sep 09 '21 at 21:51
1

@TarJae Yes, now the explanations are added. – ThomasIsCoding Sep 09 '21 at 22:23

ID	VAR	CATEGORY
1	A	ANE
1	A	ANE
1	A	ANA
1	A	ANB
1	B	ANE
2	C	BOO
2	C	BOA
2	D	BOO
3	E	CAT
3	E	CAT
4	F	DOG
4	A	ANE
4	B	ANE
4	F	DOG
4	C	FUT
4	F	DOG

ID	VAR	CATEGORY
1	A	ANE
1	A	ANE
1	A	ANA
1	A	ANB
1	B	ANE
2	C	BOO
2	C	BOA
2	D	BOO
3	E	CAT
3	E	CAT
4	F	DOG
4	A	ANE
4	B	ANE
4	F	DOG
4	C	FUT
4	F	DOG

How to choose the most common value in a group related to other group in R?

4 Answers4

dplyr

Explanation

ID	VAR	CATEGORY
1	A	ANE
1	A	ANE
1	A	ANA
1	A	ANB
1	B	ANE
2	C	BOO
2	C	BOA
2	D	BOO
3	E	CAT
3	E	CAT
4	F	DOG
4	A	ANE
4	B	ANE
4	F	DOG
4	C	FUT
4	F	DOG