Finding average for every 3 columns in dataframe in R

Question

I want to find an average of every 3 columns in data frame which contains 60 columns so that a new data frame would have 20 columns or so. I am putting sample data as below: Also, the new variable, if I can pass them as an array of string, would help.

structure(list(`1961` = c(0, 0, 0, 0, 0, 0, 0, 0, 4.633, 54.247, 
0, 0, 0, 0, 0, 0, 0, 31.036, 3.18, 19.862), `1962` = c(0, 0, 
0, 0, 0, 0, 0, 0, 4.168, 63.587, 0, 0, 0, 0, 0, 0, 0, 28.169, 
2.913, 17.273), `1963` = c(0, 0, 0, 0, 0, 0, 0, 0, 3.284, 56.888, 
0, 0, 0, 0, 0, 0, 0, 26.667, 2.653, 16.586), `1964` = c(0, 0, 
0, 0, 0, 0, 0, 0, 2.689, 48.722, 0, 0, 0, 0, 0, 0, 0, 25.483, 
3.873, 15.708), `1965` = c(0, 0, 0, 0, 0, 0, 0, 0, 3.304, 33.838, 
0, 0, 0, 0, 0, 0, 0, 28.164, 3.927, 11.147), `1966` = c(0, 0, 
0, 0, 0, 0, 0, 0, 2.871, 26.695, 0, 0, 0, 0, 0, 0, 0, 28.962, 
4.434, 14.056), `1967` = c(0, 0, 0, 0, 0, 0, 0, 0, 2.752, 36.246, 
0, 0, 0, 0, 0, 0, 0, 30.877, 4.739, 14.765), `1968` = c(0, 0, 
0, 0, 0, 0, 0, 0, 3.537, 33.368, 0, 0, 0, 0, 0, 0, 0, 25.628, 
5.445, 14.372), `1969` = c(0, 0, 0, 0, 0, 0, 0, 0, 2.484, 35.711, 
0, 0, 0, 0, 0, 0, 0, 27.123, 5.286, 15.527)), row.names = c("Almonds, with shell", 
"Anise, badian, fennel, coriander", "Apples", "Apricots", "Areca nuts", 
"Asparagus", "Avocados", "Bananas", "Barley", "Bastfibres, other", 
"Beans, dry", "Beans, green", "Berries nes", "Broad beans, horse beans, dry", 
"Buckwheat", "Cabbages and other brassicas", "Carrots and turnips", 
"Cashew nuts, with shell", "Cassava", "Castor oil seed"), class = "data.frame")

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

Alternatively, this can be solved using base R:

sapply(seq(2, ncol(mydf), 3), function(j) rowMeans(mydf[, j+(-1:1)]))

or

sapply(seq(1, ncol(mydf), 3), function(j) rowMeans(mydf[, j+(0:2)]))

                                      [,1]      [,2]      [,3]
Almonds, with shell               0.000000  0.000000  0.000000
Anise, badian, fennel, coriander  0.000000  0.000000  0.000000
Apples                            0.000000  0.000000  0.000000
Apricots                          0.000000  0.000000  0.000000
Areca nuts                        0.000000  0.000000  0.000000
Asparagus                         0.000000  0.000000  0.000000
Avocados                          0.000000  0.000000  0.000000
Bananas                           0.000000  0.000000  0.000000
Barley                            4.028333  2.954667  2.924333
Bastfibres, other                58.240667 36.418333 35.108333
Beans, dry                        0.000000  0.000000  0.000000
Beans, green                      0.000000  0.000000  0.000000
Berries nes                       0.000000  0.000000  0.000000
Broad beans, horse beans, dry     0.000000  0.000000  0.000000
Buckwheat                         0.000000  0.000000  0.000000
Cabbages and other brassicas      0.000000  0.000000  0.000000
Carrots and turnips               0.000000  0.000000  0.000000
Cashew nuts, with shell          28.624000 27.536333 27.876000
Cassava                           2.915333  4.078000  5.156667
Castor oil seed                  17.907000 13.637000 14.888000

The benefit is that the aggregation is solely based on position, not on column names. This is opposed to A.S.K.'s approach which requires that the column names can be coerced to a consecutive sequence of integers, i.e., a sequence of years.

However, the code above returns a matrix, not a data.frame. In addition, the OP has requested to pass the new variable as an array of string.

library(magrittr)   # piping used to improve readability
new_cols <- c("Mean_A", "Mean_B", "Mean_C")
sapply(seq(1, ncol(mydf), 3), function(j) rowMeans(mydf[, j+(0:2)])) %>% 
  as.data.frame() %>% 
  set_names(new_cols)

                                    Mean_A    Mean_B    Mean_C
Almonds, with shell               0.000000  0.000000  0.000000
Anise, badian, fennel, coriander  0.000000  0.000000  0.000000
Apples                            0.000000  0.000000  0.000000
Apricots                          0.000000  0.000000  0.000000
Areca nuts                        0.000000  0.000000  0.000000
Asparagus                         0.000000  0.000000  0.000000
Avocados                          0.000000  0.000000  0.000000
Bananas                           0.000000  0.000000  0.000000
Barley                            4.028333  2.954667  2.924333
Bastfibres, other                58.240667 36.418333 35.108333
Beans, dry                        0.000000  0.000000  0.000000
Beans, green                      0.000000  0.000000  0.000000
Berries nes                       0.000000  0.000000  0.000000
Broad beans, horse beans, dry     0.000000  0.000000  0.000000
Buckwheat                         0.000000  0.000000  0.000000
Cabbages and other brassicas      0.000000  0.000000  0.000000
Carrots and turnips               0.000000  0.000000  0.000000
Cashew nuts, with shell          28.624000 27.536333 27.876000
Cassava                           2.915333  4.078000  5.156667
Castor oil seed                  17.907000 13.637000 14.888000

BTW: Reconsider data structures

The dataset looks more like a matrix than a data.frame, i.e., all columns are of the same data type. Otherwise, it would not be possible to aggregate by means across columns. Perhaps, the data should be treated as matrix where we can benefit from matrix operations like rowMeans().

Rows and columns of a matrix can be named as well:

library(magrittr)
new_cols <- c("Mean_A", "Mean_B", "Mean_C")
sapply(seq(1, ncol(mydf), 3), function(j) rowMeans(mydf[, j+(0:2)])) %>% 
  set_colnames(new_cols)

                                    Mean_A    Mean_B    Mean_C
Almonds, with shell               0.000000  0.000000  0.000000
Anise, badian, fennel, coriander  0.000000  0.000000  0.000000
Apples                            0.000000  0.000000  0.000000
Apricots                          0.000000  0.000000  0.000000
Areca nuts                        0.000000  0.000000  0.000000
Asparagus                         0.000000  0.000000  0.000000
Avocados                          0.000000  0.000000  0.000000
Bananas                           0.000000  0.000000  0.000000
Barley                            4.028333  2.954667  2.924333
Bastfibres, other                58.240667 36.418333 35.108333
Beans, dry                        0.000000  0.000000  0.000000
Beans, green                      0.000000  0.000000  0.000000
Berries nes                       0.000000  0.000000  0.000000
Broad beans, horse beans, dry     0.000000  0.000000  0.000000
Buckwheat                         0.000000  0.000000  0.000000
Cabbages and other brassicas      0.000000  0.000000  0.000000
Carrots and turnips               0.000000  0.000000  0.000000
Cashew nuts, with shell          28.624000 27.536333 27.876000
Cassava                           2.915333  4.078000  5.156667
Castor oil seed                  17.907000 13.637000 14.888000

The printed output looks similar to the data.frame solution but the underlying data structure is now a matrix.

Alternatively, the data can be stored in long format after reshaping (which is what the call to gather() does in A.S.K.'s approach). Then, the column names become data objects and can be manipulated as such.

A. S. K. · Answer 2 · 2019-08-04T20:13:49.837

We can use the tidyverse to rotate the data from wide to long, and then group by 3-year sequences.

library(dplyr)
library(tidyr)
library(tibble)

df.averaged = df %>%
  # Extract rownames as their own column
  rownames_to_column("product") %>%
  # Convert from wide to long: one row per product per year
  gather(year, value, -product) %>%
  # Add a column with "year group" (every 3 years go into one group)
  mutate(year = as.numeric(year),
         year.group = (floor((year + 1) / 3) * 3) - 1) %>%
  # Group by product and year group
  group_by(product, year.group) %>%
  # Get averages
  summarize(value = mean(value)) %>%
  # Convert back from long to wide, if desired
  spread(year.group, value)

Edit: For a moving average, we can use the RcppRoll package (see also this answer).

library(RcppRoll)
df.moving.window = df %>%
  # Extract rownames as their own column
  rownames_to_column("product") %>%
  # Convert from wide to long: one row per product per year
  gather(year, value, -product) %>%
  # Order by product, then year
  arrange(product, year) %>%
  # Compute the rolling average
  group_by(product) %>%
  mutate(value = roll_mean(value, n = 3, align = "right", fill = NA)) %>%
  # Convert back from long to wide, if desired
  spread(year, value)

Thanks! can you explain this `(floor((year + 1) / 3) * 3) - 1)`. I couldn't follow. — Ambrish Dhaka, Aug 03 '19 at 07:02
To map each year to the first year in its group of 3. `floor(year / 3) * 3` would convert 0, 1, and 2 to 0; 3, 4, and 5 to 3; and so on. The `+ 1` and `- 1` just move the window a little bit so that 1961-3 go into the same group (otherwise it would be 1959-61, 1962-4, etc.). — A. S. K., Aug 03 '19 at 07:14
thanks that is great, is there anything like moving average for each column taken for 3 years. — Ambrish Dhaka, Aug 03 '19 at 07:34

score 1 · Accepted Answer · answered Aug 03 '19 at 11:36

Here's another way that is somewhat robust:

n <- 3
i <- seq(1, length(DF), n)

DF2 <- data.frame(nut = rownames(DF))
DF2[, paste0('NewCol', seq_along(i))] <- lapply(i, function (j) rowMeans(DF[, j:min(j+2, length(DF))]))

DF2

                                nut   NewCol1   NewCol2   NewCol3
1               Almonds, with shell  0.000000  0.000000  0.000000
2  Anise, badian, fennel, coriander  0.000000  0.000000  0.000000
3                            Apples  0.000000  0.000000  0.000000
4                          Apricots  0.000000  0.000000  0.000000
5                        Areca nuts  0.000000  0.000000  0.000000
6                         Asparagus  0.000000  0.000000  0.000000
7                          Avocados  0.000000  0.000000  0.000000
8                           Bananas  0.000000  0.000000  0.000000
9                            Barley  4.028333  2.954667  2.924333
10                Bastfibres, other 58.240667 36.418333 35.108333
11                       Beans, dry  0.000000  0.000000  0.000000
12                     Beans, green  0.000000  0.000000  0.000000
13                      Berries nes  0.000000  0.000000  0.000000
14    Broad beans, horse beans, dry  0.000000  0.000000  0.000000
15                        Buckwheat  0.000000  0.000000  0.000000
16     Cabbages and other brassicas  0.000000  0.000000  0.000000
17              Carrots and turnips  0.000000  0.000000  0.000000
18          Cashew nuts, with shell 28.624000 27.536333 27.876000
19                          Cassava  2.915333  4.078000  5.156667
20                  Castor oil seed 17.907000 13.637000 14.888000

Some things to point out are that the output is a data.frame. The lapply() function returns a list. Those lists are then assigned back to new columns in DF2.

The most important thing is the j:min(j+2, length(DF)). This part will allow the code whether there are 2 columns or 3 columns.

score 1 · Answer 4 · answered Aug 03 '19 at 14:24

Interesting problem! Made another take using purrr and hablar. Create a list of integers for every 3 columns. Apply row means on every 3 columns and combine into new df. Copy rownames from original df.

Code

library(tidyverse)
library(hablar)
library(magrittr)

l <- map(seq(1, ncol(df), 3), ~seq(.x, .x + 2))

map_dfc(l, ~df %>% transmute(mean = row_mean_(.x))) %>% 
  set_rownames(rownames(df))

Result

                                      mean     mean1     mean2
Almonds, with shell               0.000000  0.000000  0.000000
Anise, badian, fennel, coriander  0.000000  0.000000  0.000000
Apples                            0.000000  0.000000  0.000000
Apricots                          0.000000  0.000000  0.000000
Areca nuts                        0.000000  0.000000  0.000000
Asparagus                         0.000000  0.000000  0.000000
Avocados                          0.000000  0.000000  0.000000
Bananas                           0.000000  0.000000  0.000000
Barley                            4.028333  2.954667  2.924333
Bastfibres, other                58.240667 36.418333 35.108333
Beans, dry                        0.000000  0.000000  0.000000
Beans, green                      0.000000  0.000000  0.000000
Berries nes                       0.000000  0.000000  0.000000
Broad beans, horse beans, dry     0.000000  0.000000  0.000000
Buckwheat                         0.000000  0.000000  0.000000
Cabbages and other brassicas      0.000000  0.000000  0.000000
Carrots and turnips               0.000000  0.000000  0.000000
Cashew nuts, with shell          28.624000 27.536333 27.876000
Cassava                           2.915333  4.078000  5.156667
Castor oil seed                  17.907000 13.637000 14.888000

Finding average for every 3 columns in dataframe in R

4 Answers4

BTW: Reconsider data structures