Aggregating mixed data by factor column

Question

For the past week I have been trying to aggregate my dataset that consists of different weight measurements in different months accompanied by a large volume of background variables in R.

I have read many different asked questions on this topic (i.e. R aggregate data by defining grouping, How to aggregate count of unique values of categorical variables in R), but they all seem to either only work with one type of data or are only interested in one column. Specifically, question Recoding categorical variables to the most common value deals with almost exactly the same problem, but the proposed answer only fixes the problem for the categorical data, it does not include the numeric data as well. My data consist of both factors(categorical and ordinal) and numeric data.

The reproducible example is:

IDnumber <- c("1", "1", "1", "2", "2", "3", "3", "3")
Gender <- c("Male", "Male", "Male", "Female", "Female", "Female", "Female",  "Female")
Weight <- c(80, 82, 82, 70, 66, 54, 50, 52)
LikesSoda <- c("Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", NA)
df = data.frame(IDnumber, Gender, Weight, LikesSoda)

My output dataframe would take the mean of each numerical column, and the most frequent factor for each factor column. In the example this would look as following:

IDnumber <- c("1", "2", "3")
Gender <- c("Male", "Female", "Female")
Weight <- c(81.5, 78, 52)
LikesSoda <- c("No", "Yes", "Yes")
output = data.frame(IDnumber, Gender, Weight, LikesSoda)

So far I've tried to split the dataframe into a factor dataframe and numeric dataframe and use two aggregates with a different function (mean for the numeric, but I've not been able to find a working function for the categorical data). The other option is to use a dplyr df &>& group_by(IDnumber) %>% summarise( transformation for each variable ) code, but that requires me to specify how to handle each column manually. Since I have over 2500 columns, this does not seem like a workable solution.

David_B · Accepted Answer · 2016-05-29T16:08:44.700

You could write your own functions and then use lapply. First, write a function to find the most frequent level in a factor variable

getmode <- function(v) {
  levels(v)[which.max(table(v))]
}

Then write a function to return either the mean or mode depending on the type of variable passed to it

my_summary <- function(x, id, ...){
  if (is.numeric(x)) {
    return(tapply(x, id, mean))
  }  
  if (is.factor(x)) {
    return(tapply(x, id, getmode))
  }  
}

Finally, use lapply to calculate the summaries

data.frame(lapply(df, my_summary, id = df$IDnumber))
  IDnumber Gender   Weight LikesSoda
1        1   Male 81.33333        No
2        2 Female 68.00000       Yes
3        3 Female 52.00000       Yes

If there might be two or more levels in a factor with the same, maximum frequency then which.max will just return the first one. I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode function slightly, so it adds an asterisk to the level when there is a tie:

getmode <- function(v) {
  tab <- table(v)
  if (sum(tab %in% max(tab)) > 1)  return(paste(levels(v)[which.max(tab)], '*'))
  levels(v)[which.max(tab)]
}

(Changing your sample data so there is one Female and one Male with IDnumber == "2")

data.frame(lapply(df, my_summary, id = df$IDnumber))

  IDnumber   Gender   Weight LikesSoda
1        1     Male 81.33333        No
2        2 Female * 68.00000       Yes
3        3   Female 52.00000       Yes

I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs.

Thank you, this seems to work well. As a novice R user I mostly use the built in functions, but this seems like a good opportunity to practice with custom functions. Should I use na.rm=TRUE in the mean function in case of missings in the numeric columns? — GIREFA, May 29 '16 at 13:54
You're welcome! Yes, you could use na.rm = TRUE, but you'll first have to edit the my_summary function slightly: `my_summary <- function(x, id, na.rm = FALSE){ if (is.numeric(x)) { return(tapply(x, id, mean, na.rm = na.rm)) } if (is.factor(x)) { return(tapply(x, id, getmode)) } }` Then you can use `data.frame(lapply(df, my_summary, id = df$IDnumber, na.rm = TRUE))`. — David_B, May 29 '16 at 14:01
One final question. The documentation on which.max says that the first maximum is returned. In case of my complete data, there may be cases where on a yes/no question, a participant has answered Yes and No once, and had 1 missing. In this case the function will choose No since it is alphabetically "lower". I'm trying to find a way to gain insight into in how many cases this is happening, but having trouble since which.max only returns 1 value and max returns NULL. I know that when I want to keep the columns as string factors there is no workaround, but want to know whether it has an impact. — GIREFA, May 29 '16 at 15:20

Aggregating mixed data by factor column

1 Answers1

Linked