Using aggregate() function in R for multiple columns

Question

I have the following data:

dat <- structure(list(Year = c(1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 
1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1979L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 1980L, 
1980L, 1980L, 1980L, 1980L, 1980L), Month = c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), Day = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 
28L, 29L), Rainfall = c(1, 35.5, 20.3, 2.5, 32, 66.8, 0, 0, 1.8, 
0, 5.3, 0, 0, 0, 11.7, 40.4, 45.7, 15.3, 21.6, 10.5, 26.2, 54.1, 
1.5, 26.9, 39.4, 21.6, 1.3, 95.6, 10.2, 0, 5.1, 0, 4.1, 2.9, 
0, 0.5, 2.1, 15.7, 14.2, 28.7, 134.2, 26.3, 0, 0, 0, 2.3, 0, 
2.8, 0.3, 0.8, 0, 0, 1.8, 0, 0, 0.8, 0, 0, 3.3, 9.1, 31.5, 24.6, 
18.5, 0, 37.1, 111.3, 4.3, 21.1, 3.1, 0, 0, 0, 15.8, 30.7, 6.4, 
68.6, 97.5, 64.3, 47.3, 0, 2.3, 8.7, 53.9, 6.9, 20.9, 94, 7.4, 
0, 1.3, 0, 10.9, 0, 81.8, 10.2, 2.5, 1, 3.1, 12.3, 41.9, 85.9, 
74.4, 13.8, 79.9, 117.7, 16.5, 31.2, 1.3, 1.3, 4.6, 7.1, 0, 0, 
0, 2, 4.8, 2, 3.8, 94.2, 1.8), test = c(1, 35.5, 20.3, 2.5, 32, 
66.8, 0, 0, 1.8, 0, 5.3, 0, 0, 0, 11.7, 40.4, 45.7, 15.3, 21.6, 
10.5, 26.2, 54.1, 1.5, 26.9, 39.4, 21.6, 1.3, 95.6, 10.2, 0, 
5.1, 0, 4.1, 2.9, 0, 0.5, 2.1, 15.7, 14.2, 28.7, 134.2, 26.3, 
0, 0, 0, 2.3, 0, 2.8, 0.3, 0.8, 0, 0, 1.8, 0, 0, 0.8, 0, 0, 3.3, 
9.1, 31.5, 24.6, 18.5, 0, 37.1, 111.3, 4.3, 21.1, 3.1, 0, 0, 
0, 15.8, 30.7, 6.4, 68.6, 97.5, 64.3, 47.3, 0, 2.3, 8.7, 53.9, 
6.9, 20.9, 94, 7.4, 0, 1.3, 0, 10.9, 0, 81.8, 10.2, 2.5, 1, 3.1, 
12.3, 41.9, 85.9, 74.4, 13.8, 79.9, 117.7, 16.5, 31.2, 1.3, 1.3, 
4.6, 7.1, 0, 0, 0, 2, 4.8, 2, 3.8, 94.2, 1.8)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
55L, 56L, 57L, 58L, 59L, 366L, 367L, 368L, 369L, 370L, 371L, 
372L, 373L, 374L, 375L, 376L, 377L, 378L, 379L, 380L, 381L, 382L, 
383L, 384L, 385L, 386L, 387L, 388L, 389L, 390L, 391L, 392L, 393L, 
394L, 395L, 396L, 397L, 398L, 399L, 400L, 401L, 402L, 403L, 404L, 
405L, 406L, 407L, 408L, 409L, 410L, 411L, 412L, 413L, 414L, 415L, 
416L, 417L, 418L, 419L, 420L, 421L, 422L, 423L, 424L, 425L), class = "data.frame")

This is a sample data of daily rainfall from 1979 to 1980. There are four columns: Year, Month, Day, Rainfall, and test.

I want to get the:

(1) Mean annual total per column. In this example, the average of the two years. 

So, get the annual total first then get the average across the years. 

I will be applying this for data with 60 columns from 1979 to 2017.

I don't know how to do this with aggregate() in R. Im using the following commands but this is only for one column:

dat2<-aggregate(Rainfall~Year,dat,FUN=sum,na.rm=T, na.action=NULL)
mean(dat2$Rainfall)

dat3<-aggregate(test~Year,dat,FUN=sum,na.rm=T, na.action=NULL)
mean(dat3$test)

Is there an easy way to do this in R? Right now I'm doing it manually per column.

I'll appreciate any help on this!

You need to calculate mean of rainfall and test for each year, something like this: dat %>% group_by(Year) %>% summarise(Avg_Rainfall = mean(Rainfall), Avg_test = mean(test)) — Karthik S, Oct 08 '20 at 06:29
`aggregate` can take multiple variables at a time on the left hand side - `aggregate(cbind(Rainfall,test) ~ Year, data=dat, FUN=sum)` — thelatemail, Oct 08 '20 at 06:37
@KarthikS I need to get the sum first then the average. Not the annual mean — Lyndz, Oct 08 '20 at 06:39

Ronak Shah · Answer 1 · 2020-10-08T06:40:44.550

1

In aggregate you can pass multiple variables with cbind and to get the mean use colMeans :

dat3<-aggregate(cbind(Rainfall, test)~Year,dat,FUN=sum,na.rm=T, na.action=NULL)
colMeans(dat3[-1])

However, if there are many columns it is better to get data in long format and then summarise them :

library(dplyr)

dat %>%
  tidyr::pivot_longer(cols = c(Rainfall, test)) %>%
  group_by(Year, name) %>%
  summarise(sum = sum(value, na.rm = TRUE)) %>%
  summarise(mean = mean(sum))

edited Oct 08 '20 at 06:40

answered Oct 08 '20 at 06:39

Ronak Shah

377,200
20
156
213

the file has 60 columns...Year Month Day, Rainfall, test.....until the 60th. Thanks for this! – Lyndz Oct 08 '20 at 06:39
the one using the dplyr, it doesn't get the annual accumulated first? – Lyndz Oct 08 '20 at 06:41
1

Corrected the `dplyr` method. How can you uniquely identify those 60 columns. In `cols` you can select columns by position, `1:60` or by pattern in their name. `col1`, `col2` etc. – Ronak Shah Oct 08 '20 at 06:42
1

@Lyndz you can use something like `dat[4:ncol(dat)]` to select all you columns without using explicit names – Paul Oct 08 '20 at 06:43

Paul · Accepted Answer · 2020-10-08T09:15:39.493

1

An alternative way using only aggregate() and/or colMeans(), showing how to select many columns without having to name them:

# Sum    
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
# Mean by year
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
# Mean without grouping per year
colMeans(dat[4:ncol(dat)])

The output:

> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
  Year Rainfall   test
1 1979    833.1  833.1
2 1980   1492.6 1492.6
> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
  Year Rainfall     test
1 1979 14.12034 14.12034
2 1980 24.87667 24.87667
> colMeans(dat[4:ncol(dat)])
Rainfall     test 
 19.5437  19.5437

EDIT: in case they are some non-numeric columns in the mix that should stay as non-numeric, you can replace dat[4:ncol(dat)] by Filter(is.numeric, dat[4:ncol(dat)], see an example below:

> dat$test <- as.character(dat$test)
> aggregate(Filter(is.numeric, dat[4:ncol(dat)]), by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
  Year Rainfall
1 1979    833.1
2 1980   1492.6

edited Oct 08 '20 at 09:15

answered Oct 08 '20 at 06:49

Paul

2,850
1
12
37

I tried this but I am getting the following error: Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument. I think the first row (column headers) are included in the sum()? How can I remove this? – Lyndz Oct 08 '20 at 08:37
Or maybe is this because of the NA value? – Lyndz Oct 08 '20 at 08:43
1

@Lyndz, with the sample `dat` I was not able to get this error. My guess is that not all your columns are of type numeric, maybe, for ex, your column 45 is of type character. – Paul Oct 08 '20 at 08:47
1

See this test: `> dat$test <- as.character(dat$test) > aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL) Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument` – Paul Oct 08 '20 at 08:51
here's the link to the data that I'm applying this: https://www.dropbox.com/s/80s37e1s9awopqu/data_station_daily.csv?dl=0 – Lyndz Oct 08 '20 at 08:52
I read the csv like this `dat<-read.csv("data_station_daily.csv",header=T,sep=",")` – Lyndz Oct 08 '20 at 08:54
Your column `MAASIN` is of type character – Paul Oct 08 '20 at 08:57
OMG! I found it..there's some texts there – Lyndz Oct 08 '20 at 09:01
1

It should be possible to add something to only aggregate and sum with numeric columns, but sometimes it is useful to have these messages because maybe this column is expected to be numeric but the import screwed it up. – Paul Oct 08 '20 at 09:04
Yes..Thanks! haha. – Lyndz Oct 08 '20 at 09:07
But if you replace `dat[4:ncol(dat)]` by `Filter(is.numeric, dat[4:ncol(dat)])` it should work – Paul Oct 08 '20 at 09:10
I already replaced the characters in that column. Your code is working now. :-) – Lyndz Oct 08 '20 at 09:11

Using aggregate() function in R for multiple columns

2 Answers2