2

With a dataset like original:

id <- c("JF", "GH", "GH", "ANN", "GH", "ROG", "JF")
group <- c("most", "least", "most", "least", "least", "most", "least")
NP <- c(4,6,18,1,3,12,8)
iso_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_CHN <- c(0, 1, 1, 0, 0, 0, 0)
color <- c("blue", "orange", "blue", "blue", "red", "orange", "black")

original <- data.frame(id, group, NP, iso_USA, iso_CHN, color)


numeric <- unlist(lapply(original, is.numeric))  
numeric <- names(original[ , numeric])

char <- unlist(lapply(original, is.character))  
char <- names(original[ , char])
char <- char[-1]   #remove id from variables of interest

I want to group by "group" and calculate the median for the numeric variables and the mode for the character variables. Therefore, the data looks like original2. Note that my actual dataset has way more columns than the mock version presented here:

group <- c("least", "most")
NP <- c(6,12)
iso_USA <- c(0,1)
iso_CHN <- c(0, 0)
color <- c("orange", "blue")

original2 <- data.frame(group, NP, iso_USA, iso_CHN, color)

Any clue?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
vog
  • 770
  • 5
  • 11

2 Answers2

4

Using dplyr's across functionality and the the accepted answer at the FAQ about implementing a mode function:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
original %>%
  select(-id) %>%
  group_by(group) %>%
  summarize(
    across(where(is.numeric), median),
    across(where(is.character), Mode)
  )
# # A tibble: 2 × 6
#   group    NP iso_USA iso_CHN color 
#   <chr> <dbl>   <dbl>   <dbl> <chr> 
# 1 least   4.5       0       0 orange
# 2 most   12         1       0 blue  
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
2

Here's a base R solution.

Little EDIT (thx @Gergor Thomas for your valuable input)

For this I also need

* a function that can tell me if elements are castable to numeric and - obviously -

  • a Mode function (which I will copy blatantly from @Gregor Thomas's solution).
Mode <- function(x){ ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))] }

Usage

aggregate( original[,3:6], by=list(original$group), function(x){ 
  'if'( is.numeric(x),median(x),Mode(x) ) } )

  Group.1   NP iso_USA iso_CHN  color
1   least  4.5       0       0 orange
2    most 12.0       1       0   blue
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • 1
    Thank you for your intellectual curiosity and proposal! – vog Dec 02 '21 at 15:38
  • @GregorThomas All true and I agree. I also played around with finding a more performant solution. It's actually approx 3 times slower than your dplyr solution. The regex did boil down to avoiding the warnings `as.numeric` returns. In a different scenario I'd actually choose speed over "beauty". I included the cast because in this case aggregate treats the number columns as character. – Andre Wildberg Dec 02 '21 at 16:22
  • @GregorThomas Godammit, I hate when I am so deep in a rabbit hole to not see the obvious... :) well thanks! In that case I might switch to using the `'data.frame'` version of aggregate. – Andre Wildberg Dec 02 '21 at 16:42
  • @GregorThomas Now 6 times faster than dplyr, but of course that's a reasonable price to pay for readability, as you pointed out. – Andre Wildberg Dec 02 '21 at 17:25
  • @GregorThomas Interesting, good to know... I actually did it with a 100k rows, 100 cols version of the example data. Thx!! – Andre Wildberg Dec 02 '21 at 17:54
  • The number of groups makes a big difference for scalability - that's where `dplyr` beats `base` (and where `data.table` really shines). – Gregor Thomas Dec 02 '21 at 17:58
  • @GregorThomas Yep, `data.table` is a beast. Thanks for your insights!! – Andre Wildberg Dec 02 '21 at 18:05
  • 1
    Fun discussion :) I am going to delete most of my comments here, I don't think they need to preserved on SO. – Gregor Thomas Dec 02 '21 at 18:29