Standard Deviation across multiple csv files into a single file

Question

I want to summarize 1000 files with 40 rows and 20 columns. I want to produce two summary files, each conserving the original dimension 40x20. First file with means and second one with the standard deviation of each of position in the file across all 1000 values. From this post below I found a very elegant way to do the mean across all the files (thanks @josliber), but I am struggling to extend that logic to the standard deviation.

Average multiple csv files into 1 averaged file in r

I am in the point that I loaded my data in a list of dataframes

csvs <- lapply(list.files(pattern="weather*.csv"), read.csv)

And Reduced worked fine to get my mean summary file. Can we do something similar (or different) to get my standard deviations.

Reduce("+", csvs) / length(csvs)

r2evans · Answer 1 · 2017-02-21T22:33:58.067

Another option opens up several other statistical options.

If you convert the list of 40x20 data.frames into a 40x20x1000 array, you can apply across each of the 40x20 "tubes" drilling into the 3rd dimension.

Using a sample of three 2x4 matrices:

set.seed(42)
lst <- lapply(1:3, function(ign) matrix(sample(8), nrow=2))
lst
# [[1]]
#      [,1] [,2] [,3] [,4]
# [1,]    8    2    3    4
# [2,]    7    5    6    1
# [[2]]
#      [,1] [,2] [,3] [,4]
# [1,]    6    3    7    8
# [2,]    5    4    1    2
# [[3]]
#      [,1] [,2] [,3] [,4]
# [1,]    8    3    4    2
# [2,]    1    6    7    5

Using the abind library, we can arbitrarily bind along the third dim. (This is where you would begin, given that your data.frames are already captured in a list. abind works equally well with identically-sized data.frames as it does with matrices.)

library(abind)
ary <- abind(lst, along = 3)
dim(ary)
# [1] 2 4 3

And now run arbitrary functions along each "tube" (versus "row" or "column", as most consider apply to be used for). For example, given the [1,1] values in the three layers are 8, 6, and 8, we would expect the following statistics:

mean(c(8,6,8))
# [1] 7.333333
sd(c(8,6,8))
# [1] 1.154701

Now, using apply:

apply(ary, 1:2, mean)
#          [,1]     [,2]     [,3]     [,4]
# [1,] 7.333333 2.666667 4.666667 4.666667
# [2,] 4.333333 5.000000 4.666667 2.666667
apply(ary, 1:2, sd)
#          [,1]      [,2]     [,3]     [,4]
# [1,] 1.154701 0.5773503 2.081666 3.055050
# [2,] 3.055050 1.0000000 3.214550 2.081666

This opens up some more statistical aggregation of your 1000 identically-sized data.frames, assuming that the index within each layer is meaningfully comparable. You might be able to devise a working model to determine the median or other percentile with Reduce, but it's quite easy to do (say) apply(ary, 1:2, quantile, 0.9) for the 90th percentile.

Thanks a lot for this solution! – ecolog Feb 21 '17 at 23:07 — ecolog, Feb 21 '17 at 23:07

score 0 · Accepted Answer · answered Feb 21 '17 at 22:10

You could do a similar thing again, but use the basic maths behind the standard deviation calculation:

# get the means as before
means <- Reduce( "+", csvs ) / length( csvs )

# make a new list of deviations from that known mean
st.dev <- lapply( csvs, function(x) ( x - means )^2 )

# use the list of deviations to calculate the standard deviation matrix
st.dev <- sqrt( Reduce( "+", st.dev ) / length( st.dev ) )

For details on the maths here, search Wikipedia for "Standard deviation".

This is really neat, thanks for the stats review. I made a lot of sense. — ecolog, Feb 21 '17 at 23:04

Standard Deviation across multiple csv files into a single file

2 Answers2