10

I am trying to find the means, not including NAs, for multiple columns withing a dataframe by multiple groups

airquality <- data.frame(City = c("CityA", "CityA","CityA",
                                  "CityB","CityB","CityB",
                                  "CityC", "CityC"),
                         year = c("1990", "2000", "2010", "1990", 
                                  "2000", "2010", "2000", "2010"),
                         month = c("June", "July", "August",
                                   "June", "July", "August",
                                   "June", "August"),
                         PM10 = c(runif(3), rnorm(5)),
                         PM25 = c(runif(3), rnorm(5)),
                         Ozone = c(runif(3), rnorm(5)),
                         CO2 = c(runif(3), rnorm(5)))
airquality

So I get a list of the names with the number so I know which columns to select:

nam<-names(airquality)
namelist <- data.frame(matrix(t(nam)));namelist

I want to calculate the mean by City and Year for PM25, Ozone, and CO2. That means I need columns 1,2,4,6:7)

acast(datadf, year ~ city, mean, na.rm=TRUE)

But this is not really what I want because it includes the mean of something I do not need and it is not in a data frame format. I could convert it and then drop, but that seems like a very inefficient way to do it.

Is there a better way?

mskfisher
  • 3,291
  • 4
  • 35
  • 48
Jen
  • 203
  • 1
  • 2
  • 12
  • 1
    Perhaps `library(dplyr); airquality %>% group_by(City, year) %>% summarise_at(vars("PM25", "Ozone", "CO2"), mean)` – akrun Sep 20 '17 at 19:32

3 Answers3

11

We can use dplyr with summarise_at to get mean of the concerned columns after grouping by the column of interest

library(dplyr)
airquality %>%
   group_by(City, year) %>% 
   summarise_at(vars("PM25", "Ozone", "CO2"), mean)

Or using the devel version of dplyr (version - ‘0.8.99.9000’)

airquality %>%
     group_by(City, year) %>%
     summarise(across(PM25:CO2, mean))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    So I tested the suggested answers by adding 2 data points for city A in 2010 and two data points for City C in 2000. The – Jen Sep 21 '17 at 21:57
  • Has this changed? When I run these two solutions I get errors for both: first `Error in get(.x, .env, mode = "function") : first argument has length > 1` and for the second solution: `Error in dots[[.index]] : subscript out of bounds` – Dylan_Gomes Oct 13 '22 at 20:55
  • @Dylan_Gomes both solutions work for me (tested now). The second one is the more uptodate – akrun Oct 13 '22 at 20:57
  • @Dylan_Gomes have you used the data showed i n the question post or is it some other data – akrun Oct 13 '22 at 20:58
  • weird. It is exactly the data in the OP and the code you've provided. I can get `aggregate()` to do the same thing and it works, but for some reason these don't for me. – Dylan_Gomes Oct 13 '22 at 20:59
  • @Dylan_Gomes please check your `packageVersion("dplyr")` and your R version – akrun Oct 13 '22 at 20:59
  • `> packageVersion("dplyr") [1] ‘1.0.9’ > sessionInfo() R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)` – Dylan_Gomes Oct 13 '22 at 21:00
  • @Dylan_Gomes it should have worked. I used `1.0.10` and R 4.2.0 – akrun Oct 13 '22 at 21:01
  • @Dylan_Gomes Try on a fresh R session – akrun Oct 13 '22 at 21:01
  • restarting R worked, but not sure why. I tried using the dplyr namespace in front of the `summarise` functions even. Nonetheless, this worked. Thanks for that. – Dylan_Gomes Oct 13 '22 at 21:04
  • @akrun I didn't have plyr loaded and `dplyr::summarise` did *not* work, while restarting R did work - that was my point and why I am confused. – Dylan_Gomes Oct 13 '22 at 21:09
  • @Dylan_Gomes sorry, I read your comments incorrectly. There could be multiple reasons for that including some env pollution. Is it in Rstudio? – akrun Oct 13 '22 at 21:14
3

The summarise_at solution by Colin is simplest, but of course there are several. Here is another solution, using tidyr to rearrange and calculate the mean:

airquality %>%  
  select(City, year, PM25, Ozone, CO2) %>% 
  gather(var, value, -City, -year) %>%
  group_by(City, year, var) %>% 
  summarise(avg = mean(value, na.rm=T)) %>% # can stop here if you want
  spread(var, avg) # optional to make this into a wider table
# A tibble: 8 x 5
# Groups:   City, year [8]
    City   year          CO2       Ozone         PM25
* <fctr> <fctr>        <dbl>       <dbl>        <dbl>
1  CityA   1990  0.275981522  0.19941717  0.826008441
2  CityA   2000  0.090342153  0.50949094  0.005052771
3  CityA   2010  0.007345704  0.21893117  0.625373926
4  CityB   1990  1.148717447 -1.05983482 -0.961916973
5  CityB   2000 -2.334429324  0.28301220 -0.828515418
6  CityB   2010  1.110398814 -0.56434523 -0.804353609
7  CityC   2000 -0.676236740  0.20661529 -0.696816058
8  CityC   2010  0.229428142  0.06202997 -1.396357288
Matt L.
  • 2,753
  • 13
  • 22
2

You should try dplyr::mutate_at :

library(dplyr)
airquality %>%
  group_by(City, year) %>%
  summarise_at(.vars = c("PM10", "PM25", "Ozone", "CO2"), .funs = mean)

# A tibble: 8 x 6
# Groups:   City [?]
    City   year         PM10       PM25      Ozone         CO2
  <fctr> <fctr>        <dbl>      <dbl>      <dbl>       <dbl>
1  CityA   1990  0.004087379  0.5146409 0.44393422  0.61196671
2  CityA   2000  0.039414194  0.8865582 0.06754322  0.69870187
3  CityA   2010  0.116901563  0.6608619 0.51499227  0.32952099
4  CityB   1990 -1.535888778 -0.9601897 1.17183649  0.08380664
5  CityB   2000  0.226046487  0.4037230 0.86554997 -0.05698204
6  CityB   2010 -0.824719956  0.1508471 0.32089806 -0.12871853
7  CityC   2000 -0.824509111 -0.6928741 0.85553837  0.12137923
8  CityC   2010 -1.626150294  1.5176198 0.21183149 -0.63859910
Colin FAY
  • 4,849
  • 1
  • 12
  • 29