Average two variables based on two columns

Question

I am new to R, and have discovered the aggregate command. I don't know if there is a way to apply it with two variables instead of one, however.

I have a dataframe, approval_agg which has four columns of month_year which is the month, subgroup which is a demographic, and approve_estimate and disapprove_estimate which are approval and disapproval ratings, respectively.

I would like to get the average ratings for each month and subgroup. Some example data I posted below:

month_year      subgroup     approve_estimate    disapprove_estimate
2020-11-01      Voters        53                 47               
2020-11-01      All polls     56                 44
2020-11-01      Adults        54                 46
2020-11-01      Voters        54                 46               
2020-11-01      All polls     53                 47
2020-11-01      Adults        49                 51
2020-10-01      Voters        57                 43
2020-10-01      All polls     56                 44
2020-10-01      Adults        60                 40
2020-10-01      Voters        51                 49
2020-10-01      All polls     57                 43
2020-10-01      Adults        53                 47

which I would like to get:

2020-11-01      Voters        53.5               46.5               
2020-11-01      All polls     54.5               45.5
2020-11-01      Adults        51.5               48.5
2020-10-01      Voters        56                 44               
2020-10-01      All polls     56.5               43.5
2020-10-01      Adults        56.5               43.5

I have my aggregate column for one column as aggregate(. ~ month_year, df, mean), but I get NA values. Is there a way I can use aggregate or anything to get these mean values?

score 1 · Accepted Answer · answered Dec 15 '20 at 23:12

We can use summarise with across

library(dplyr)
df1 %>%
    group_by(month_year, subgroup) %>% 
    summarise(across(ends_with('estimate'), mean, na.rm = TRUE), .groups = 'drop')

If there are NA elements, use na.rm = TRUE in mean along with na.action = NULL to make sure that the NA row is not eliminated in aggregate

aggregate(. ~ month_year + subgroup, df1, mean, na.rm = TRUE, na.action = NULL)

score 0 · Answer 2 · answered Dec 16 '20 at 00:02

Solution using data.table,

Assume that df is the example data.frame

dt = data.table(df)
dt[, approve_estimate_mean:=mean(approve_estimate), by=list(month_year, subgroup)]
dt[, disapprove_estimate_mean:=mean(disapprove_estimate), by=list(month_year, subgroup)]
df = as.data.frame(dt)

The difference is that it repeats the values instead of reducing by making group.

The result is

   month_year  subgroup approve_estimate disapprove_estimate approve_estimate_mean disapprove_estimate_mean
1  2020-11-01    Voters               53                  47                  53.5                     46.5
2  2020-11-01 All polls               56                  44                  54.5                     45.5
3  2020-11-01    Adults               54                  46                  51.5                     48.5
4  2020-11-01    Voters               54                  46                  53.5                     46.5
5  2020-11-01 All polls               53                  47                  54.5                     45.5
6  2020-11-01    Adults               49                  51                  51.5                     48.5
7  2020-10-01    Voters               57                  43                  54.0                     46.0
8  2020-10-01 All polls               56                  44                  56.5                     43.5
9  2020-10-01    Adults               60                  40                  56.5                     43.5
10 2020-10-01    Voters               51                  49                  54.0                     46.0
11 2020-10-01 All polls               57                  43                  56.5                     43.5
12 2020-10-01    Adults               53                  47                  56.5                     43.5

Average two variables based on two columns

2 Answers2