3

I have in R the following data frame:

ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
             "E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
        "CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY

DATA = data.frame(ID,VAR,CATEGORY);DATA

That looks like this table below :

ID VAR CATEGORY
1 A ANE
1 A ANE
1 A ANA
1 A ANB
1 B ANE
2 C BOO
2 C BOA
2 D BOO
3 E CAT
3 E CAT
4 F DOG
4 A ANE
4 B ANE
4 F DOG
4 C FUT
4 F DOG

ideal output given the above data frame in R I want to be like that:

ID TEXTS category
1 A ANE
2 C BOO
3 E CAT
4 F DOG

More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.

How can I do it in R ? Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
Homer Jay Simpson
  • 1,043
  • 6
  • 19

4 Answers4

6

Another dplyr strategy using count and slice:

library(dplyr)
DATA %>% 
    group_by(ID) %>% 
    count(VAR, CATEGORY) %>% 
    slice(which.max(n)) %>% 
    select(-n)
     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOA     
3     3 E     CAT     
4     4 F     DOG  
TarJae
  • 72,363
  • 6
  • 19
  • 66
3

dplyr

library(dplyr)
DATA %>%
  group_by(ID) %>%
  filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
  group_by(ID, VAR) %>%
  summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
  ungroup()
# # A tibble: 4 x 3
#      ID VAR   CATEGORY
#   <dbl> <chr> <chr>   
# 1     1 A     ANE     
# 2     2 C     BOA     
# 3     3 E     CAT     
# 4     4 F     DOG     

Data

DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
3

We could modify the Mode to return the index and use that in slice after grouping by 'ID'

Modeind <- function(x) {
   ux <- unique(x)
   which.max(tabulate(match(x, ux)))
   }
library(dplyr)
DATA %>%
     group_by(ID) %>%
     slice(Modeind(VAR)) %>%
     ungroup

-output

# A tibble: 4 x 3
     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOO     
3     3 E     CAT     
4     4 F     DOG     
akrun
  • 874,273
  • 37
  • 540
  • 662
  • .@akrun, yours is the only one, which reflects OPs ouptput. In ID 2 which one should be choosen `BOA` or `BOO`. Both are possible? – TarJae Sep 09 '21 at 20:17
  • 1
    @TarJae In the `Modeind` function it is jus selecting the first occurrence in case of ties. I don't know whether they need to sample in that case or not – akrun Sep 09 '21 at 20:19
  • 1
    @TarJae I noticed that. In those cases when there is not a clear order (other than as-presented), I tend to infer ties don't matter much. If the OP needs something else, they'll say it or it will come out in spiral-development. – r2evans Sep 09 '21 at 20:20
3

A base R option with nested subset + ave

subset(
  subset(
    DATA,
    !!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
  ),
  !!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)

gives

   ID VAR CATEGORY
1   1   A      ANE
6   2   C      BOO
9   3   E      CAT
11  4   F      DOG

Explanation

  1. The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
  2. Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81