0

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.

Here is what I am trying:

head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))

But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).

Also, I want to use select because I'm trying to do this over just certain variables.

I'm sure this could be a duplicate, but I can't find the right one.

EDIT FOR CLARITY Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:

Month v1 v2 v3 Sum 
1      1  1  0   2
1      1  1  1   3
1      1  0  0   3

Then the result would be:

Month  Average
1           8/3
vashts85
  • 1,069
  • 3
  • 14
  • 28
  • Your description, "group a dataframe by one variable and figure the mean of several variables within that group," sounds like you want `airquality %>% group_by(Month) %>% select(target_vars) %>% summarise_all(mean, na.rm = T)`. It doesn't sound like `rowSums` are needed at all. – Gregor Thomas Oct 23 '17 at 14:49
  • 1
    Why would you need to `group_by` if you're calculating `rowSums`? – Abdou Oct 23 '17 at 14:49
  • @vashts85 when you calculate `Average` are you dividing by the number of columns? Or the number of rows? It seems some of the answers are dividing by the number of columns. – CPak Oct 26 '17 at 01:00
  • I suppose it's the number of rows. So month 1 has 3 rows, and the sum of each row is 8, so you divided 8 by 3. – vashts85 Oct 26 '17 at 01:58
  • @vashts85 it looks Jimbou is dividing by number of columns (perhaps Jimbou can add confirmation here). I recommend calculating the mean of rowSums for the 5th month to see which answer gives you the expected answer. The answers all differ so you'll have to decide which one provides the solution you're looking for. – CPak Oct 26 '17 at 02:10

3 Answers3

2

This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):

sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
    5     6     7     8     9 
 7541  8343 10849  8974  8242 

If you wanted the per number of variable results, then you would divide by the number of variables:

sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
                                                           (length(target_vars))
       5        6        7        8        9 
2513.667 2781.000 3616.333 2991.333 2747.333 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Plase comment on the output of `sum(airquality[,-5],na.rm = T)` and `sum(sapply( split( airquality, airquality$Month), sum, na.rm=TRUE))`. i think its the question if you add the Month variable to the result or not...In my solution I don't. In contrast your result includes the sum of Month. – Roman Oct 23 '17 at 15:50
  • You are correct, but our methods still don't agree. It's interesting that you can see the error in mine bur yours is rather opaque, so I'm unable to debug you methods, which I still think are wrong ... somewhere. – IRTFM Oct 23 '17 at 16:13
  • Hm why you think my solution is opaque? – Roman Oct 24 '17 at 07:37
  • Because I think it's wrong (at least it gets a different final result than an R action which I think is rather clear), and I cannot follow the logic to figure out where the error is. I think one would need to look at each intermediate result and determine whether it is actually what is intended. So that to me is "opaque". – IRTFM Oct 25 '17 at 16:48
  • @Jimbou Take a look at `sum( aq[ aq$Month==5, !grepl("Month", colnames(aq) ) ], na.rm=TRUE) # [1] 8397.3` and at `sum( aq[ aq$Month==5, !grepl("Month", colnames(aq) ) ], na.rm=TRUE)/5 #[1] 1679.46` and tell me why your answers are different. – IRTFM Oct 25 '17 at 23:57
  • This is easy. My answer is different because I considered only the columns using the `target_vars = c("Ozone","Temp","Solar.R")` as defined in the question. Take a look at `sum( aq[ aq$Month==5, target_vars ], na.rm=TRUE) [1] 7541` – Roman Oct 26 '17 at 07:46
2

You can try:

library(tidyverse)
airquality %>% 
  select(Month, target_vars) %>% 
  gather(key, value, -Month) %>% 
  group_by(Month) %>%
  summarise(n=length(unique(key)),
            Sum=sum(value, na.rm = T)) %>% 
  mutate(Average=Sum/n)
# A tibble: 5 x 4
  Month     n   Sum  Average
  <int> <int> <int>    <dbl>
1     5     3  7541 2513.667
2     6     3  8343 2781.000
3     7     3 10849 3616.333
4     8     3  8974 2991.333
5     9     3  8242 2747.333

The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.

Roman
  • 17,008
  • 3
  • 36
  • 49
  • Looks possibly interesting but check my answer and explain why out numbers disagree. I think my answer is more clearly correct, at least in a numeric sense if not being dplyr-isch. – IRTFM Oct 23 '17 at 15:42
  • Can you explain the `gather` line? I haven't really used `gather` or `spread` so I'm not familiar with their parameters. – vashts85 Oct 23 '17 at 15:56
  • Have a look [here](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) or [here](http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/). There it is much better explained al I could ever do. – Roman Oct 24 '17 at 07:17
1

Perhaps this is what you're looking for

library(dplyr)
library(purrr)
library(tidyr)   # forgot this in original post
airquality %>%
  group_by(Month) %>% 
  nest(Ozone, Temp, Solar.R, .key=newcol) %>%
  mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))

# A tibble: 5 x 2
  # Month   newcol
  # <int>    <dbl>
# 1     5 243.2581
# 2     6 278.1000
# 3     7 349.9677
# 4     8 289.4839
# 5     9 274.7333  

I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month

airquality %>%
  filter(Month == 5) %>%
  select(Ozone, Temp, Solar.R) %>%
  mutate(newcol = rowSums(., na.rm=TRUE)) %>%
  summarise(sum5 = sum(newcol), mean5 = mean(newcol))

#   sum5    mean5
# 1 7541 243.2581
CPak
  • 13,260
  • 3
  • 30
  • 48