4

I'm trying to calculate the mean and standard deviation of several columns (except the first column) in a data.frame with NA values.

I've tried colMeans, sapply, etc., to create a loop that runs through the data.frame and then stores means and standard deviations in a separate table but keep getting a "FUN" error. any help would be great. Thanks

a

David C.
  • 1,974
  • 2
  • 19
  • 29
Anand Roopsind
  • 581
  • 2
  • 8
  • 11
  • 1
    Can you post the code for what you've tried? It's not clear where you're getting stuck. – Scott Ritchie Dec 27 '13 at 03:27
  • 1
    A "fun" error is not a helpful way to put it. what might help is the exact text of the error msg - don't assume that nobody would understand it anyway. – lebatsnok Dec 27 '13 at 06:06

3 Answers3

10
sapply(df, function(cl) list(means=mean(cl,na.rm=TRUE), sds=sd(cl,na.rm=TRUE)))
      col1     col2     col3     col4     col5    
means 3        8        12.5     18.25    22.5    
sds   1.581139 1.581139 1.290994 1.707825 1.290994

as.data.frame( t(sapply(df, function(cl) list(means=mean(cl,na.rm=TRUE), 
                                              sds=sd(cl,na.rm=TRUE))) ))
     means      sds
col1     3 1.581139
col2     8 1.581139
col3  12.5 1.290994
col4 18.25 1.707825
col5  22.5 1.290994
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • It's good practice to replace the main argument with `...` to improve clarity: `sapply(df, function(...) list(means=mean(..., na.rm=TRUE), sds=sd(..., na.rm=TRUE)))` – smci Mar 27 '17 at 03:43
  • You think that is "more clear"? Mine looks more concrete, and I think would generate more informative error messages, and I thought "more clear", but perhaps I'm missing something deeper or even something more obvious? – IRTFM Mar 28 '17 at 02:19
  • Yes, the use of ellipsis for the main arg(s) which we are not modifying is more clear and it's a convention in R. As you can see from skimming it, it calls out very clearly that the point of our code is to add the non-default arg `na.rm=TRUE` to both fn calls. And more powerfully, the ellipsis can stand for multiple args. And the error message from yours aren't going to be any clearer. – smci Mar 28 '17 at 02:23
  • I am somewhat familiar with the R-dots (ellipsis) parsing formalism, but I missed that consensus document. I would read material if linked. – IRTFM Mar 28 '17 at 02:32
  • There's no authoritative citation I'm aware of, but it's pretty widespread in code; Hadley uses it a lot in his packages. Here's [one citation](https://ipub.com/r-three-dots-ellipsis/) – smci Mar 28 '17 at 02:36
  • Actually here's a somewhat authoritative citation [How to use R's ellipsis feature when writing your own function?](http://stackoverflow.com/questions/3057341/how-to-use-rs-ellipsis-feature-when-writing-your-own-function). Just to prove that this is convention. It doesn't mention the advantages I named. – smci Mar 28 '17 at 02:45
4

The functions you should be using (e.g. colMeans) will almost all have a parameter called na.rm which defaults to FALSE. Just do colMeans(x = your_df, na.rm = TRUE) and you'll be good to go. Same with using just mean() if you want to go column by column.

Adam Hyland
  • 878
  • 1
  • 9
  • 21
2

The following example code may prove useful.

# Create a 5 column dataframe that contains some NAs
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
col3 <- c(11,12,13,14,NA)
col4 <- c(16,NA,18,19,20)
col5 <- c(21,22,23,24,NA)
dataframe <- data.frame(col1,col2,col3,col4,col5)

# Apply the mean() function to all but the first column of the dataframe
apply(dataframe[,2:ncol(dataframe)], 2, function(x) mean(x, na.rm=TRUE))

# Check that the returned values are correct:
mean(col2)
mean(col3, na.rm=TRUE)
mean(col4, na.rm=TRUE)
mean(col5, na.rm=TRUE)

For the standard deviation, replace mean() with sd().

Graeme Walsh
  • 638
  • 7
  • 20