Median for all numeric and mode for all characters in R

Question

With a dataset like original:

id <- c("JF", "GH", "GH", "ANN", "GH", "ROG", "JF")
group <- c("most", "least", "most", "least", "least", "most", "least")
NP <- c(4,6,18,1,3,12,8)
iso_USA <- c(1, 0, 0, 0, 0, 1, 1)
iso_CHN <- c(0, 1, 1, 0, 0, 0, 0)
color <- c("blue", "orange", "blue", "blue", "red", "orange", "black")

original <- data.frame(id, group, NP, iso_USA, iso_CHN, color)


numeric <- unlist(lapply(original, is.numeric))  
numeric <- names(original[ , numeric])

char <- unlist(lapply(original, is.character))  
char <- names(original[ , char])
char <- char[-1]   #remove id from variables of interest

I want to group by "group" and calculate the median for the numeric variables and the mode for the character variables. Therefore, the data looks like original2. Note that my actual dataset has way more columns than the mock version presented here:

group <- c("least", "most")
NP <- c(6,12)
iso_USA <- c(0,1)
iso_CHN <- c(0, 0)
color <- c("orange", "blue")

original2 <- data.frame(group, NP, iso_USA, iso_CHN, color)

Any clue?

score 4 · Accepted Answer · answered Dec 01 '21 at 19:27

4

Using dplyr's across functionality and the the accepted answer at the FAQ about implementing a mode function:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

library(dplyr)
original %>%
  select(-id) %>%
  group_by(group) %>%
  summarize(
    across(where(is.numeric), median),
    across(where(is.character), Mode)
  )
# # A tibble: 2 × 6
#   group    NP iso_USA iso_CHN color 
#   <chr> <dbl>   <dbl>   <dbl> <chr> 
# 1 least   4.5       0       0 orange
# 2 most   12         1       0 blue

answered Dec 01 '21 at 19:27

Gregor Thomas

136,190
20
167
294

How can you omit NA values? – vog Dec 01 '21 at 19:53
1

`across(where(is.numeric), median, na.rm = TRUE)`. See my linked FAQ for a few different `Mode` implementations, some of which handle missing values. – Gregor Thomas Dec 01 '21 at 20:02
Thank you, Gregor! – vog Dec 02 '21 at 15:39

Andre Wildberg · Answer 2 · 2021-12-02T17:15:28.090

2

Here's a base R solution.

Little EDIT (thx @Gergor Thomas for your valuable input)

For this I also need

* a function that can tell me if elements are castable to numeric and - obviously -

a Mode function (which I will copy blatantly from @Gregor Thomas's solution).

Mode <- function(x){ ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))] }

Usage

aggregate( original[,3:6], by=list(original$group), function(x){ 
  'if'( is.numeric(x),median(x),Mode(x) ) } )

  Group.1   NP iso_USA iso_CHN  color
1   least  4.5       0       0 orange
2    most 12.0       1       0   blue

edited Dec 02 '21 at 17:15

answered Dec 01 '21 at 21:57

Andre Wildberg

12,344
3
12
29

1

Thank you for your intellectual curiosity and proposal! – vog Dec 02 '21 at 15:38
@GregorThomas All true and I agree. I also played around with finding a more performant solution. It's actually approx 3 times slower than your dplyr solution. The regex did boil down to avoiding the warnings `as.numeric` returns. In a different scenario I'd actually choose speed over "beauty". I included the cast because in this case aggregate treats the number columns as character. – Andre Wildberg Dec 02 '21 at 16:22
@GregorThomas Godammit, I hate when I am so deep in a rabbit hole to not see the obvious... :) well thanks! In that case I might switch to using the `'data.frame'` version of aggregate. – Andre Wildberg Dec 02 '21 at 16:42
@GregorThomas Now 6 times faster than dplyr, but of course that's a reasonable price to pay for readability, as you pointed out. – Andre Wildberg Dec 02 '21 at 17:25
@GregorThomas Interesting, good to know... I actually did it with a 100k rows, 100 cols version of the example data. Thx!! – Andre Wildberg Dec 02 '21 at 17:54
The number of groups makes a big difference for scalability - that's where `dplyr` beats `base` (and where `data.table` really shines). – Gregor Thomas Dec 02 '21 at 17:58
@GregorThomas Yep, `data.table` is a beast. Thanks for your insights!! – Andre Wildberg Dec 02 '21 at 18:05
1

Fun discussion :) I am going to delete most of my comments here, I don't think they need to preserved on SO. – Gregor Thomas Dec 02 '21 at 18:29

Median for all numeric and mode for all characters in R

2 Answers2

Usage