How to summarize based on multiple columns in R?

Question

I want to summarize the dataset based on "year", "months", and "subdist_id" columns. For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years. For example, for subdist_id 81, the mean Rainfall value of 2004 will be the mean Rainfall of months 11, 12 of 2004, and months 1,2 of 2005.

I am getting no clue how to do it, although I searched online rigorously.

Can you upload reproducible data? You can run `dput("name_of_the_dataset")` and copy the output as a code to your question. — Bloxx, Apr 03 '22 at 23:11
It's unclear what you're asking or what end result you want—a [reproducible example](https://stackoverflow.com/q/5963269/5325862) including what you've tried that didn't work would help, but whatever the issue is has likely been covered before. What you're describing seems like a rolling mean — camille, Apr 04 '22 at 01:28

score 2 · Accepted Answer · answered Apr 04 '22 at 00:20

Expanding on @Bloxx's answer and incorporating my comment:

# Set up example data frame:
df = data.frame(year=c(rep.int(2004,2),rep.int(2005,4)), 
                month=((0:5%%4)-2)%%12+1,
                Rainfall=seq(.5,by=0.15,length.out=6))

Now use mutate to create year2 variable:

df %>% mutate(year2 = year - (month<3)*1) # or similar depending on the problem specs

And now apply the groupby/summarise action:

df %>% mutate(year2 = year - (month<3)*1) %>% 
       group_by(year2) %>% 
       summarise(Rainfall = mean(Rainfall))

score 0 · Answer 2 · answered Apr 03 '22 at 23:21

0

Lets assume your dataset is called df. Is this what you are looking for?

df %>% group_by(subdist_id, year) %>% summarise(Rainfall = mean(Rainfall))

answered Apr 03 '22 at 23:21

Bloxx

1,495
1
9
21

sorry for the confusion! The mean rainfall will depend on months that overlap in two years. For example, 2004 mean rainfall will be average of months 11,12 of 2004 and 1,2 of 2005. – Evan Apr 03 '22 at 23:32
What do you mean by overlap? – Bloxx Apr 03 '22 at 23:36
X year's mean rainfall should be the average of 11th and 12th months' rainfall of X year and 1st and 2nd months' rainfall of X+1 year. – Evan Apr 04 '22 at 00:02
Looks like you need to set up a new variable, say `year2` that is equal to the value of `year`-1 when `month` is 1 or 2 and then apply Bloxx's groupby/summarise operation. I often set up similar "water year" variables based on calendar year and month values. – njp Apr 04 '22 at 00:07
@njp that's the correct way I was looking for – Evan Apr 04 '22 at 04:38

score 0 · Answer 3 · answered Apr 04 '22 at 01:25

0

I think you can simply do this:

df %>% filter(months %in% c(1,2,11,12)) %>%
  group_by(subdist_id, year=if_else(months %in% c(1,2),year-1,year)) %>% 
  summarize(meanRain = mean(Rainfall))

Output:

  subdist_id  year meanRain
       <dbl> <dbl>    <dbl>
1         81  2004    0.611
2         81  2005    0.228

Input:

df = data.frame(
  subdist_id = 81,
  year=c(2004,2004, 2005, 2005, 2005, 2005),
  months=c(11,12,1,2,11,12),
  Rainfall = c(.251,.333,.731,1.13,.111,.346)
)

answered Apr 04 '22 at 01:25

langtang

22,248
1
12
27

This assumes that they'll only ever be interested in those 4 specific months, which seems unlikely – camille Apr 04 '22 at 01:30
true, but that's what the OP specifically stated. As in "For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years." – langtang Apr 04 '22 at 01:32

How to summarize based on multiple columns in R?

3 Answers3