calculate average over multiple data frames

Question

I would like to use R for plotting performance evaluation results of distinct DB systems. For each system I loaded the same data and execute the same queries in several iterations.

The data for a single systems looks like this:

"iteration", "lines", "loadTime", "query1", "query2", "query3"
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2

2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5

3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2

What I need now (for plotting) is a matrix or data frame containing the average of these measurements.

At the moment I am doing this:

# read the file
all_results <- read.csv(file="file.csv", head=TRUE, sep=",")

# split the results by iteration
results <- split(all_results, all_results$iteration)

# convert each result into a data frane
r1 = as.data.frame(results[1])
r2 = as.data.frame(results[2])
r3 = as.data.frame(results[3])

# calculate the average
(r1 + r2 +r3) / 3

I could put all this into a function and calculate the average matrix in a for loop, but I have the vague feeling that there must be a more elegant solution. Any ideas?

What can I do for cases when I have incomplete results, e.g., when one iteration has less rows than the others?

Thanks!

score 4 · Accepted Answer · answered Jan 19 '11 at 17:29

If I understand you correctly, on a given DB system, in each "iteration" (1...N) you are loading a sequence of DataSets (1,2,3) and running queries on them. It seems like at the end you want to calculate the average time across all iterations, for each DataSet. If so, you actually need to have an additional column DataSet in your all_results table that identifies the DataSet. We can add this column as follows:

all_results <- cbind( data.frame( DataSet = rep(1:3,3) ), all_results )
> all_results
  DataSet iteration  lines loadTime query1 query2 query3
1       1         1 100000    120.4    0.5    6.4    1.2
2       2         1 100000    110.1    0.1    5.2    2.1
3       3         1  50000    130.3    0.2    4.3    2.2
4       1         2 100000    120.4    0.1    2.4    1.2
5       2         2 100000    300.2    0.2    4.5    1.4
6       3         2  50000    235.3    0.4    4.2    0.5
7       1         3 100000    233.5    0.7    8.3    6.7
8       2         3 100000    300.1    0.9    0.5    4.4
9       3         3  50000    100.2    0.4    9.2    1.2

Now you can use the ddply function from the plyr package to easily extract the averages for the load and query times for each DataSet.

> ddply(all_results, .(DataSet), colwise(mean, .(loadTime, query1, query2)))
  DataSet loadTime    query1 query2
1       1 158.1000 0.4333333    5.7
2       2 236.8000 0.4000000    3.4
3       3 155.2667 0.3333333    5.9

Incidentally, I highly recommend you look at Hadley Wickham's plyr package for a rich set of data-manipulation functions

that's indeed more elegant than for loops...thanks for the hint. — behas, Jan 19 '11 at 22:26
what if I want to include the "lines" column in the output frame of ddply? I could do that by simply calculating the mean also over the lines; but this somehow doesn't make sense since the line numbers are static values; — behas, Jan 20 '11 at 08:58
I see what you mean, but I suppose there's no harm in taking the "mean" of a bunch of identical values! — Prasad Chalasani, Jan 21 '11 at 14:21

score 3 · Answer 2 · answered Jan 19 '11 at 18:17

3

I don't see why you need to split all_results by iteration. You can just use aggregate on all_results. There's no need for all iterations to have the same number of observations.

Lines <- "iteration, lines, loadTime, query1, query2, query3
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2
2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5
3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2"

all_results <- read.csv(textConnection(Lines))

aggregate(all_results[,-1], by=all_results[,"iteration",drop=FALSE], mean)

answered Jan 19 '11 at 18:17

Joshua Ulrich

173,410
32
338
418

I think the OP needs to average by DataSet as I said in my answer, not by "iteration" ( it doesn't make sense to take the average of a dataset that has 100000 lines and one with 50000 lines). At least that's my interpretation of what he really wants to do. – Prasad Chalasani Jan 19 '11 at 18:28
... but the idea of using `aggregate` is nice though (+1). I was just trying to use this as an exercise to practice using `plyr`. – Prasad Chalasani Jan 19 '11 at 18:32

score 1 · Answer 3 · answered Jan 19 '11 at 16:41

1

Did you have something like this in mind?

do.call("rbind", lapply(results, mean))

answered Jan 19 '11 at 16:41

Roman Luštrik

69,533
24
154
197

G. Grothendieck · Answer 4 · 2011-01-20T05:10:56.850

1

Try this:

> Reduce("+", results) / length(results)
  DataSet iteration lines loadTime    query1 query2   query3
1       1         2 1e+05 158.1000 0.4333333    5.7 3.033333
2       2         2 1e+05 236.8000 0.4000000    3.4 2.633333
3       3         2 5e+04 155.2667 0.3333333    5.9 1.300000

An aggregate solution which also works for the unbalanced case follows. Assume that the ith row of any iteration is for data set i and that we simply average within datasets. Using aggregate is straight forward. The only tricky part is getting the assignment of rows to data sets correct so that it works in the unbalanced case too. That is done by the list(data.set = ...) expression.

> it <- all_results$iteration
> aggregate(all_results, list(data.set = seq_along(it) - match(it, it) + 1), mean)
  data.set iteration lines loadTime    query1 query2   query3
1        1         2 1e+05 158.1000 0.4333333    5.7 3.033333
2        2         2 1e+05 236.8000 0.4000000    3.4 2.633333
3        3         2 5e+04 155.2667 0.3333333    5.9 1.300000

edited Jan 20 '11 at 05:10

answered Jan 19 '11 at 17:37

G. Grothendieck

254,981
17
203
341

I was thinking of suggesting the same thing, but `+` doesn't work if every `data.frame` isn't the same size. – Joshua Ulrich Jan 19 '11 at 18:10
The data frames are the same size since the poster indicated that the code he posted does work but its just that he wants something more "elegant". – G. Grothendieck Jan 19 '11 at 21:08
@Gabor I was referring to the last sentence in his question. – Joshua Ulrich Jan 19 '11 at 22:44
OK. I have added a second solution. One could define a plus function but in this case I think its easier just to use aggregate. – G. Grothendieck Jan 20 '11 at 05:11

score 0 · Answer 5 · answered Jan 19 '11 at 16:39

0

Try, e.g.,

with(all_results, tapply(lines, iteration, mean))

answered Jan 19 '11 at 16:39

Richie Cotton

118,240
47
247
360

calculate average over multiple data frames

5 Answers5

Linked

Related