1

I need to summarise some data.frame containing both numeric and character values. I need to sum the numerical features and to keep the most frequent character value per group at the same time. Ideally, it should be with dplyr but that's not mandatory.

Here is a reproducible example :

have <- data.frame("id" = c(1,1,1,2,2,2),"a" = c("dog","dog","cat","duck","duck","horse"),"value" = c(1,2,3,4,5,6))
need <- data.frame("id" = c(1,2), "a" = c("dog","duck"), "value" = c(6,15))

Any idea ?

Levon Ipdjian
  • 786
  • 7
  • 14
  • 1
    While @Quiten's answer is totally valid and probably the way to go here, you might wanna check this question: https://stackoverflow.com/q/2547402/13460602 where they discuss how to find the statistical mode (i.e. the most frequent value). – Maël Mar 10 '23 at 09:54

3 Answers3

4

Another option is to use collapse::fmode which is probably the fastest way of finding the statistical mode:

library(dplyr) 
have %>%
  group_by(id) %>%
  summarise(a = collapse::fmode(a),
            value = sum(value)) %>%
  ungroup()

#> # A tibble: 2 × 3
#>      id a     value
#>   <dbl> <chr> <dbl>
#> 1     1 dog       6
#> 2     2 duck     15
Maël
  • 45,206
  • 3
  • 29
  • 67
3

You could group_by and sum for total value and use table with which.max and names to select the character values that frequent the most like this:

library(dplyr)
have %>%
  group_by(id) %>%
  summarise(a = names(which.max(table(a))),
            value = sum(value))
#> # A tibble: 2 × 3
#>      id a     value
#>   <dbl> <chr> <dbl>
#> 1     1 dog       6
#> 2     2 duck     15

Created on 2023-03-10 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
3

Here we first count and then apply the max a to all other and finally summarise:

library(dplyr) # dplyr >= 1.1.0

have %>% 
  add_count(id, a) %>% 
  mutate(a = first(a), .by=id) %>% 
  summarise(value = sum(value), .by=c(id, a))
 
  id    a value
1  1  dog     6
2  2 duck    15
TarJae
  • 72,363
  • 6
  • 19
  • 66