1

I have a data frame with factor columns. Here is a tiny example:

dat <- data.frame(one = factor(c("a", "b")), two = factor(c("c", "d")))

I can calculate the means of the numeric values that underlie the factor labels for each column:

mean(as.integer(dat$one))
[1] 1.5

But since there are very many columns in my data frame, I would like to avoid having to calculate all the individual means and would rather do something like:

colMeans(dat)

which doesn't work, since the columns are factors, or

colMeans(as.integer(dat))

which doesn't work either.

So how can I easily calculate the means of all factor columns, without a loop or individually calculating them all?

Do I really have to change the class of all columns?

Community
  • 1
  • 1
  • 2
    `colMeans(data.matrix(dat))` could work. – David Arenburg Feb 03 '16 at 08:00
  • @Pascal Do `str(dat)`, which will return three lines, one of which reads: `$ one: Factor w/ 2 levels "a","b": 1 2`. This tells you that "a" and "b" are merely the labels and that the factor contains numbers. By casting the factor as numeric or integer (my second example), I can get at these numbers. –  Feb 03 '16 at 08:08
  • Thank you, @DavidArenburg, that is perfect. –  Feb 03 '16 at 08:09
  • You should beware of such operations though. Sometimes the underlying integers could be pretty messed up. – David Arenburg Feb 03 '16 at 08:22
  • Thanks for the reminder, @DavidArenburg, that's easy to forget. In the present case I took great care to correctly sort my lables and think about wether the factors are actually more than ordinally scaled. –  Feb 03 '16 at 09:11

4 Answers4

2

The data.matrix is pretty much designed for such a task. It also skips numeric and integer columns, if present, and hence reduces memory usage, though the conversion to matrix could be an overhead, sometimes. So as long you don't have character columns, this should be pretty straightforward

colMeans(data.matrix(dat))
# one two 
# 1.5 1.5
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

We can use lapply

lapply(dat, function(x) mean(as.integer(x)))

Or with dplyr

library(dplyr)
dat %>%
  summarise_each(funs(mean(as.integer(.))))

For big datasets, it may be better to calculate the mean by each column separately as converting to matrix may also create memory issues.

akrun
  • 874,273
  • 37
  • 540
  • 662
0

Write a simple function that uses a for loop to write all of the values into a vector.

dat <- data.frame(one = c(1:10), two = c(1:10))

colMeans <- function(tablename){
  i <- 1
  colmean <- c(1:ncol(tablename))

  for(i in c(1:ncol(tablename))){
    colmean[i] <- mean(tablename[,i])
  }
  return(colmean)
}

colMeans(dat)

Hope this works

  • 1
    FYI, there's already a function called `colMeans` so you might want to give your custom function a different name – talat Feb 03 '16 at 08:28
0

You can also use data.table package, which is faster than data.frame. if your data is big e.g. millions of observations, than you need data.table to optimize run time.

Below is the code:

library(data.table)
dat <- data.table(one = factor(c("a", "b")), two = factor(c("c", "d")))
factorCols <- c("one", "two")
dat[, lapply(.SD, FUN=function(x) mean(as.integer(x))), .SDcols=factorCols]
Kumar Manglam
  • 2,780
  • 1
  • 19
  • 28