1

There is a data frame x with 5753 observations of 4 variables.

The column names are: date, Depth, var1, and var2. I converted date and Depth to factor before performing aggregate().

I wanted to calculate average and standard deviation to 2 variables with grouping by date and Depth.

When applying aggregate(x[,3:4], by = list(x$date, x$Depth), FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got average of var1 and average of var 2 grouping by date and Depth, but I did not get SD.

When applying aggregate(. ~ date+Depth, data = x, FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got an error message: "Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate".

After counting NA in two column, I found out that there are 5622 NA in var1, 5049 NA in var2. I donot want to remove NA before applying aggregate() yet.

My questions are:

  1. why I did not get sd by applying the first syntax?

  2. why is the second syntax not workable? I learned this syntax from stackoverflow, and it worked with the following data frame,

    x3 <- read.table(text = "  id1 id2 val1 val2
    1   a   x    1    9
    2   a   x    2    4
    3   a   y    3    NA
    4   a   y    4    NA
    5   b   x    1    NA
    6   b   y    4    NA
    7   b   x    3    9
    8   b   y    2    8", header = TRUE)
    
r2evans
  • 141,215
  • 6
  • 77
  • 149
pinawa
  • 129
  • 6

1 Answers1

2

We can use dplyr, where we pass the grouping columns in group_by and the columns to summarise in summarise with across

library(dplyr) #1.0.0
x3 %>%
   group_by(id1, id2) %>% 
   summarise(across(starts_with('val'),  
        list(mean = ~ mean(., na.rm = TRUE) , sd = ~sd(., na.rm = TRUE))))
# A tibble: 4 x 6
# Groups:   id1 [2]
#  id1   id2   val1_mean val1_sd val2_mean val2_sd
#  <chr> <chr>     <dbl>   <dbl>     <dbl>   <dbl>
#1 a     x           1.5   0.707       6.5    3.54
#2 a     y           3.5   0.707     NaN     NA   
#3 b     x           2     1.41        9     NA   
#4 b     y           3     1.41        8     NA   

If the version of dplyr is < 1.0.0, we can use summarise_at

x3 %>%
   group_by(id1, id2) %>%  
   summarise_at(vars(-group_cols()), list(mean = ~ mean(., na.rm = TRUE), 
              sd = ~ sd(., na.rm = TRUE)))

With aggregate, the error we get because of the NA elements and it uses by default na.action = na.drop removing the row if there is any NA in that row. Either specify na.action = na.pass or NULL and this would resolve that issue. But, having multiple functions to be applied with c, it will result in a matrix column. Inorder to have normal data.frame, columns, we can wrap with data.frame in do.call

do.call(data.frame, aggregate(. ~ id1 + id2, data = x3, FUN = function(x) 
   c(avg = mean(x, na.rm = TRUE), SD= sd(x, na.rm = TRUE)), na.action = NULL))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi akrun, in do.call(), what is data.frame? Is it the data.frame I want to create? – pinawa Jul 04 '20 at 21:19
  • I also applied your dplyr approach to my data.frame. I still did not get sd columns. – pinawa Jul 04 '20 at 21:21
  • @lennylin. The output from `aggregate` is a `data.frame`, but some columns are matrices because we have multiple functions on the same columns If you check the `str(aggregate(. ~ id1 + id2, data = x3, FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd(x, na.rm = TRUE)), na.action = NULL))` you will find the issue. So, I wrapped with `data.frame` in `do.call` to make those matrix as indepdent columns – akrun Jul 04 '20 at 21:21
  • @lennylin i used `dplyr 1.0.0`. if your version is below that, you can use `summarise_at`, updated the post – akrun Jul 04 '20 at 21:23
  • @ akrun, I am using `dplyr 1.0.0` with `R 4.0.2`. I also tried summarise_at, I got an error message: `across()` must only be used inside dplyr verbs. – pinawa Jul 04 '20 at 21:29
  • @lennylin is it the same data you showed in the post ('x3'). I am using R 4.0 with dplyr 1.0.0. Both the codes are working for me though – akrun Jul 04 '20 at 21:30
  • @akun, no, I am working with a data frame x with 5753 observations of 4 variables. I found solutions work well with x3, but not mine. – pinawa Jul 04 '20 at 21:33
  • @lennylin then, it must be either an issue with type of columns i.e. not numeric columns used in summarise or NA only cases? Without a small reproducible example with `dput`, it is not clear – akrun Jul 04 '20 at 21:34
  • @ akrun, I am new to `dput`. anyway, here is a structure: `> str(x3) 'data.frame': 5753 obs. of 4 variables: $ date : Date, format: "2016-09-26" "2016-12-12" "2016-09-13" "2016-05-23" ... $ Depth : chr "01~05m" "01~05m" "01~05m" "01~10m" ... $ Chlorophyll_a : num NA NA NA NA NA NA NA NA NA NA ... $ Dissolved_Oxygen_Field: num NA NA NA NA NA NA NA NA NA NA ... > ` – pinawa Jul 04 '20 at 21:41
  • @lennylin if you do `dput(droplevels(head(iris, 20)))`, it will print the structure of the first 20 rows. Similarly in your case it would be `dput(droplevels(head(x3, 20)))`. From the `str`, I see a lot of NA elements. i.e. for a particular group, there may be only NAs – akrun Jul 04 '20 at 21:43
  • @lennylin i also tried with changing some elements to NA ,so that it is all NA, but it is still working with the data. `x3$val2[1:2] <- NA` – akrun Jul 04 '20 at 21:44
  • yes, there are 5622 NA in var1, 5049 NA in var2 out of 5753 obs. – pinawa Jul 04 '20 at 21:45
  • @lennylin, but the code is working even when groups are having all NAs – akrun Jul 04 '20 at 21:45
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217231/discussion-between-lennylin-and-akrun). – pinawa Jul 04 '20 at 21:47
  • 1
    In conclusion, I was advised not using aggregate() to do with my data frame because there are so many NA, but use `group_by()` and `summarise()` approach. – pinawa Jul 04 '20 at 22:34