get the mean of a variable subset of data in R

Question

Imagine I have the following data:

    Year   Month  State  ppo 
    2011   Jan     CA    220 
    2011   Feb     CA    250
    2012   Jan     CA    230 
    2011   Jan     WA    200 
    2011   Feb     WA    210

I need to calculate the mean for each state for the year, so the output would look something like this:

    Year   Month  State  ppo  annualAvg
    2011   Jan     CA    220    230
    2011   Feb     CA    240    230
    2012   Jan     CA    260    260
    2011   Jan     WA    200    205
    2011   Feb     WA    210    205

where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.

Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?

Thanks for the help!

*"I have tried so many variations of it that I won't bother to post them all here"* -- just post your attempt(s) that was closest to your desired result. — nrussell, May 27 '16 at 16:04
I don't think you tried enough variations if you didn't come across something like `with(DF, ave(ppo, Year, State))` — Frank, May 27 '16 at 16:10
Try `library(dplyr); group_by(df, Year, State) %>% mutate(avg = mean(ppo)) `. — coffeinjunky, May 27 '16 at 16:13
See also the various other methods described here: http://www.brodieg.com/?p=7 — coffeinjunky, May 27 '16 at 16:16
[This thread](http://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame) shows several ways for mean grouped by a single variable. I can't find an SO thread showing how to get mean grouped by more than one vbl. If it doesn't exist, might be good to convert some of these comments to answers? — dww, May 27 '16 at 16:23
@dww In my opinion, the OP's case is not related to that one. The OP wants to add a column, not to make an aggregated table, seems like. Closer is http://stackoverflow.com/questions/6053620/calculate-group-means-and-assign-to-variable-of-the-same-length-as-original-data and I think the extension to multiple grouping cols there is obvious in most cases if one reads the docs for the related functions. — Frank, May 27 '16 at 16:27
@dww It's quite simple to calculate the mean grouped by more than one column in `data.table`. Please take a look at the answer I've provided below. You just need to specify the columns in `by= .(col1, col2, ...)` — rafa.pereira, May 27 '16 at 19:57
Thanks @RafaelPereira. Of course, I understand that this is simple to do. My point was, that if this is not already answered on SO, then we should provide some answers here showing the various ways. But, if this is already adequately answered elsewhere, it should be closed as a duplicate. There are already threads that are similar to this one (see earlier comments), the only small difference being the need to group by more than one variable. — dww, May 27 '16 at 23:15

rafa.pereira · Accepted Answer · 2016-05-27T17:01:38.380

1

Try this:

library(data.table)


setDT(df) 

df[ , annualAvg := mean(ppo) , by =.(Year, State) ]

edited May 27 '16 at 17:01

answered May 27 '16 at 16:35

rafa.pereira

13,251
6
71
109

That does get me the average, but it changes the rest of the table, which I don't want. If I run the above I get a new table which has one entry for each state/month/year. What I want is the same table (which will have multiple entries for Feb. 2011 in California) where each entry also has a column with the annual avg (so each entry for Feb 2011 in California would have the same annual avg value) – Amie May 27 '16 at 16:49
Ok. I understand now that you're not summarizing the data but just creating a new column, I've corrected the code – rafa.pereira May 27 '16 at 17:03

Psidom · Answer 2 · 2017-06-17T15:24:44.860

Using dplyr with group_by %>% mutate to add a column:

library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))

#Source: local data frame [5 x 5]
#Groups: Year, State [3]

#   Year  Month  State   ppo annualAvg
#  (int) (fctr) (fctr) (int)     (dbl)
#1  2011    Jan     CA   220       235
#2  2011    Feb     CA   250       235
#3  2012    Jan     CA   230       230
#4  2011    Jan     WA   200       205
#5  2011    Feb     WA   210       205

Using data.table:

library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]

df
#   Year Month State ppo annualAvg
#1: 2011   Jan    CA 220       235
#2: 2011   Feb    CA 250       235
#3: 2012   Jan    CA 230       230
#4: 2011   Jan    WA 200       205
#5: 2011   Feb    WA 210       205

Data:

structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L, 
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"), 
    State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA", 
    "WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L, 
    210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year", 
"Month", "State", "ppo", "annualAvg"), class = c("data.table", 
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)

score 0 · Answer 3 · answered May 27 '16 at 17:29

0

Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)

answered May 27 '16 at 17:29

JasonAizkalns

20,243
8
57
116

get the mean of a variable subset of data in R

3 Answers3