Means of blocks of n columns in a data.frame

Question

I have been searching around and trying multiple different approaches to average every 10th column in a data.frame. The data set is 52 rows x 60 columns. The data.frame, titled data, looks like this for the first 2 rows:

X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876

(The 4 and 5 as well as the top row are just placeholders in the file.

The data is being read and pulled from a .txt file and I want to average every 10th column to change it from 60 columns into 6. Here is some extra information that I have seen people ask for previously:

> class(data)
[1] "data.frame"

> str(data)
'data.frame':   52 obs. of  60 variables:
$ X1 : Factor w/ 53 levels "0","0.0319994",..: 31 32 34 30 51 48 45 39 36 28 ...
$ X2 : Factor w/ 48 levels "0","0.0319994",..: 27 30 29 26 46 42 39 31 23 19 ...

Most recently I have tried:

dataMean <- data.frame(Means=rowMeans(data), ncol=10)

and

dataMean <- rowMeans(data.frame(data, ncol=10))

and both give the same error about 'x' must be numeric.Any help that someone could provide will be appreciated.

Thanks in Advance!

Edit: The desired results would be something like this where the number of columns has be reduced and the arithmetic mean is calculated for every 10 columns:

X1 X2 X3 X4 X5 X6
4 14.4145   13.6921 15.7813 16.3909 18.12123    17.86484
5 16.97887  16.74208    17.72446    17.97403    19.78841    19.382

Edit2:

 > dput(df)
 structure(list(X1X2X3X4X5X6X7X8X9X10X11X12X13X14X15X16X17X18X19X20X21X22X23X24X25X26X27X28X29X30X31X32X33X34X35X36X37X38X39X40X41X42X43X44X45X46X47X48X49X50X51X52X53X54X55X56X57X58X59X60 = c("414.763714.211714.123713.663712.983713.323713.887715.099715.571716.515715.059713.531713.695713.263713.511713.423714.127713.843712.835713.627713.207714.983716.127715.619715.751716.855715.975715.967716.167717.155716.115716.355716.203716.807716.675716.483716.787716.103716.311716.063716.107716.247717.191718.123618.503618.295620.951618.063618.551619.175619.599619.203618.199616.711716.703716.787716.583717.663618.859618.3356", 

 ="516.959715.903715.391715.679715.679715.839717.151718.079618.623620.479618.879616.287716.799715.615716.991716.831716.991717.535616.351715.135716.543717.407718.431617.055717.311719.167618.239616.703717.215719.167618.207616.767718.719619.423618.271617.535618.719617.887617.247716.959717.279718.399619.551619.263620.095620.447621.535618.431620.735622.143621.663620.767619.743618.559617.951617.887618.111619.295620.351619.4876"
)), class = "data.frame", row.names = c(NA, -2L))

I think this is a duplicate. You select the nth column with something like c(rep(FALSE,n-1),TRUE) in the j-position using "[". And the recycling rules apply so it should get repeated to the entire length of a dataframe. — IRTFM, Jun 18 '18 at 23:04
https://stackoverflow.com/questions/40419203/r-rationale-for-recycling-boolean-indices-for-selection — IRTFM, Jun 18 '18 at 23:09
Maybe better - https://stackoverflow.com/questions/24440258/selecting-multiple-odd-or-even-columns-rows-for-dataframe — thelatemail, Jun 18 '18 at 23:11
I wouldn't consider that to be a duplicate, but maybe that's because I'm not sure what the OP is asking. Could you give us a clear example of the behavior you're looking for? What do you mean by "average every 10th column"? — De Novo, Jun 18 '18 at 23:11
However, the lack of a correct result in the example prevents be from knowing if you meant to say that you wanted the means of columns grouped ten at a time. — IRTFM, Jun 18 '18 at 23:11
@42- sorry I'm new to programming, what does c(rep do and what do you mean by the j-position? — TCB at UGA, Jun 18 '18 at 23:12
@DanHall I want to take every 10 columns and calculate the mean of them to reduce the number of columns in the data.frame from 60 to 6 if that makes sense — TCB at UGA, Jun 18 '18 at 23:13
@TCBatUGA edit your question to give us an example of what that would look like (the result) for your example data. That's the best way to ensure we know what behavior you're looking for. — De Novo, Jun 18 '18 at 23:14
@DanHall I have edited the question with how I am hoping I can get the data to look. Thanks for the patience. — TCB at UGA, Jun 18 '18 at 23:27
This dput is really weird, and I can't even execute it, are you sure it was displayed as such in your console ? — moodymudskipper, Jun 19 '18 at 02:14
Or rather this is the dput when I tried to get the other 50 rows of code to run in the code. — TCB at UGA, Jun 19 '18 at 02:37
@Moody_Mudskipper The `dput(data)` for the full data set is too big to post (over 50,000 characters). If there is a portion of it that you need to see I can post that. Also when I run your code I am able to get the values that you have, it is just when I try adding in the other 50 rows that the problem occurs. I get the error x must be numeric when I try substituting `df` for `data` in your code and when I try pasting the full data set into `df` I get an error saying that the lines do not have 61 elements. — TCB at UGA, Jun 19 '18 at 13:51
@Moody_Mudskipper I was able to solve the problem. The code I used is posted in the answers below. Thanks for all the help — TCB at UGA, Jun 19 '18 at 18:48

score 4 · Answer 1 · answered Jun 18 '18 at 23:18

we can use split and rowMeans:

as.data.frame(sapply(
  split(seq_along(df),(seq_along(df)-1) %/%10),
  function(x) rowMeans(df[x])
))
#          0        1        2        3        4        5
# 4 14.41450 13.69210 15.78130 16.39090 18.12123 17.86484
# 5 16.97887 16.74208 17.72446 17.97403 19.78841 19.38200

data

df <- read.table(header=TRUE,stringsAsFactors=FALSE,text="X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
           5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876")

I still received the same error of "Error in rowMeans(df[x]) : 'x' must be numeric" — TCB at UGA, Jun 18 '18 at 23:55
start with `df[] <- lapply(df,function(x) as.numeric(as.character(x)))` — moodymudskipper, Jun 19 '18 at 00:15
but better to take care of the issue upstream, you have doubles stored as factors, it means someone messed up an importation ot reformatting step somewhere — moodymudskipper, Jun 19 '18 at 00:16
The error was still the same, so do you think this is a problem with the code before this point? I did not write that portion of the code and the person that did warned me that the code was not done well but could do what they were asking. Should I try to reformat this data into a numerical matrix or something to try to get around this? — TCB at UGA, Jun 19 '18 at 00:42
please add the output of `dput(df)` to the question, that'll be much easier — moodymudskipper, Jun 19 '18 at 00:45

Maurits Evers · Answer 2 · 2018-06-19T03:30:51.933

Here is a tidyverse possibility

library(tidyverse)
df %>%
    rowid_to_column("row") %>%
    gather(k, v, -row) %>%
    mutate(group = (as.numeric(sub("X", "", k)) - 1) %/% 10) %>%
    group_by(group, row) %>%
    summarise(v.mean = mean(v)) %>%
    spread(group, v.mean) %>%
    select(-row)
## A tibble: 2 x 6
#    `0`   `1`   `2`   `3`   `4`   `5`
#  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1  14.4  13.7  15.8  16.4  18.1  17.9
#2  17.0  16.7  17.7  18.0  19.8  19.4

Update

The same works if you have more than 2 rows. Here is an example using a 50x60 data.frame.

ncol <- 60;
nrow <- 50;
df <- data.frame(matrix(runif(nrow * ncol), ncol = ncol))

df %>%
    rowid_to_column("row") %>%
    gather(k, v, -row) %>%
    mutate(group = (as.numeric(sub("X", "", k)) - 1) %/% 10) %>%
    group_by(group, row) %>%
    summarise(v.mean = mean(v)) %>%
    spread(group, v.mean) %>%
    select(-row)
## A tibble: 50 x 6
#     `0`   `1`   `2`   `3`   `4`   `5`
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.372 0.514 0.400 0.565 0.489 0.412
# 2 0.344 0.465 0.625 0.421 0.602 0.519
# 3 0.393 0.389 0.465 0.607 0.504 0.539
# 4 0.545 0.599 0.530 0.552 0.661 0.568
# 5 0.589 0.456 0.590 0.557 0.441 0.494
# 6 0.588 0.602 0.362 0.524 0.526 0.644
# 7 0.432 0.624 0.457 0.539 0.530 0.481
# 8 0.494 0.519 0.661 0.568 0.709 0.610
# 9 0.397 0.413 0.398 0.370 0.720 0.570
#10 0.639 0.495 0.551 0.717 0.721 0.496
## ... with 40 more rows

Sample data

df <- read.table(text =
    "X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60
4   14.7637 14.2117 14.1237 13.6637 12.9837 13.3237 13.8877 15.0997 15.5717 16.5157 15.0597 13.5317 13.6957 13.2637 13.5117 13.4237 14.1277 13.8437 12.8357 13.6277 13.2077 14.9837 16.1277 15.6197 15.7517 16.8557 15.9757 15.9677 16.1677 17.1557 16.1157 16.3557 16.2037 16.8077 16.6757 16.4837 16.7877 16.1037 16.3117 16.0637 16.1077 16.2477 17.1917 18.1236 18.5036 18.2956 20.9516 18.0636 18.5516 19.1756 19.5996 19.2036 18.1996 16.7117 16.7037 16.7877 16.5837 17.6636 18.8596 18.3356
5   16.9597 15.9037 15.3917 15.6797 15.6797 15.8397 17.1517 18.0796 18.6236 20.4796 18.8796 16.2877 16.7997 15.6157 16.9917 16.8317 16.9917 17.5356 16.3517 15.1357 16.5437 17.4077 18.4316 17.0557 17.3117 19.1676 18.2396 16.7037 17.2157 19.1676 18.2076 16.7677 18.7196 19.4236 18.2716 17.5356 18.7196 17.8876 17.2477 16.9597 17.2797 18.3996 19.5516 19.2636 20.0956 20.4476 21.5356 18.4316 20.7356 22.1436 21.6636 20.7676 19.7436 18.5596 17.9516 17.8876 18.1116 19.2956 20.3516 19.4876", header = T)

I received an error "Error in library(tidyverse) : there is no package called ‘tidyverse’", similar to the zoo package, do I need to download this package separately? — TCB at UGA, Jun 18 '18 at 23:58
@TCBatUGA Yes, the easiest way is to `install.packages("tidyverse")` to install relevant `tidyverse` packages. Ditto for `zoo`: `install.packages("zoo")`. — Maurits Evers, Jun 19 '18 at 00:05
So when I ran this after installing the package I got: # A tibble: 52 x 1 `` 1 NA 2 NA followed by more of the NA for each row — TCB at UGA, Jun 19 '18 at 00:40
@TCBatUGA My example is fully reproducible and based on the sample data you give. **You should confirm that you can reproduce the output given the sample data.** It sounds like your actual data might be different from the sample data you give (your sample data has 60 columns, which get collapsed into a 6 column dataset; so the 52 columns that you mention in your last comment raise an alarm bell). — Maurits Evers, Jun 19 '18 at 01:33
Thanks for the effort, whenever I run your code with the sample data it works. However, when I bring in the other 50 rows (52 total) it breaks and give the NA result — TCB at UGA, Jun 19 '18 at 01:50
@TCBatUGA I cannot reproduce your issue. I've made an edit to show an example using a 50x60 `data.frame`; there must be some critical difference between your sample and actual data that you've ignored (in other words, your sample data is not representative of your actual data). Please double-check. For example, does your actual data have column names labelled `X1`, `X2`, and so on? — Maurits Evers, Jun 19 '18 at 03:32

Marat Talipov · Answer 3 · 2018-06-19T02:07:56.733

1

rollmean from zoo might be helpful here:

library(zoo)

m <- apply(df,1,rollmean,10) 
t(m[seq(nrow(m)) %% 10 ==1,])

#         X5      X15      X25      X35      X45      X55
# 4 14.41450 13.69210 15.78130 16.39090 18.12123 17.86484
# 5 16.97887 16.74208 17.72446 17.97403 19.78841 19.38200

I reused df from the Moody_Mudskipper's answer.

edited Jun 19 '18 at 02:07

answered Jun 18 '18 at 23:37

Marat Talipov

13,064
5
34
53

When I tried this I got an error: "Error in library(zoo) : there is no package called ‘zoo’". Do I have to download that package separately? – TCB at UGA Jun 18 '18 at 23:57
Yes, `install.packages('zoo')`. – Marat Talipov Jun 19 '18 at 00:16
When I tried this code now, I got a could not find function 'transpose' error – TCB at UGA Jun 19 '18 at 00:39
Should have been `t()` – Marat Talipov Jun 19 '18 at 02:08

score -1 · Accepted Answer · answered Jun 19 '18 at 18:47

I posted this question yesterday evening, but ended up finding a solution through much more searching. What I found was that I had to turn the data.frame into a matrix and then transpose that matrix to average each column across 10 rows. I then retransposed the data back into the shape I wanted it to be.

y <- apply(as.matrix(data), 2, as.numeric)
z <- t(y)
n=10
MatrixMeanD <- colMeans(matrix(z, nrow=10))   
#dont know why but rowMeans didnt work for me, while colMeans did?

x <- t(MatrixMeanD)
MatrixMean <- t(matrix(x,,52))
write.csv(MatrixMean,"file")

Thank you to everyone that gave me advice and attempted to help me fix my code!

Correction, rowMean worked, but did not give me the correct value for the mean. for example, instead of giving me 14.4145 for the first average, it gave me 15.1667. — TCB at UGA, Jun 19 '18 at 20:28

Means of blocks of n columns in a data.frame

4 Answers4

Update

Sample data