1

First, this is a "duplicate" of Aggregating mixed data by factor column

I raise the question again because the answer does not work when there are multiple id variables in the dataset. I want to aggregate a dataset, if the variable is a factor, then show the mostly appeared factor value, if the variable is numeric, then show the average. For example: (drawing from answer by David_B)

set.seed(1)
df <- data.frame(factor=as.factor(sample(1:3,1000,T)),not.factor=rnorm(1000),id1=as.factor(rep(1:10,100)),id2=as.factor(rep(1:10,each=100)))

getmode <- function(v) {
levels(v)[which.max(table(v))]
}

ag <- function(x, id, ...){
if (is.numeric(x)) {
return(tapply(x, id, mean))
}  
if (is.factor(x)) {
return(tapply(x, id, getmode))
}  
}

Then the following code will work

df2 <- data.frame(lapply(df, ag, id = df$id2))

But not when I have multiple id variables:

df2 <- data.frame(lapply(df, ag, id = cbind(df$id1,df$id2)))

The following error will popup:

Error in tapply(x, id, getmode) : arguments must have same length
Leonhardt Guass
  • 773
  • 7
  • 24
  • Don't use `cbind`, use `paste` or something like that. – Gregor Thomas Jul 05 '18 at 16:38
  • @Gregor I know paste will work, but it also generates a final dataset with a single id variable. With multiple steps I can fix it, but I feel that is not a as "clean" solution as I want. In addition, I am curious why this error happens. – Leonhardt Guass Jul 05 '18 at 16:42

1 Answers1

2

We can group by 'id1', 'id2' and create an if/else condition to get the Mode or mean

library(dplyr)
df %>%
   group_by(id1, id2) %>% 
   summarise_all(funs(if(is.factor(.)) Mode(.) else mean(.))) 

where

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }
akrun
  • 874,273
  • 37
  • 540
  • 662