(Unfortunately, I am missing basic vocabulary to formulate my question. So, please correct me where more precise terms are useful.)
I use R to do very basic statistical analysis for benchmark results of virtual machines, and I often want to normalize my data based on some criterion.
Currently my problem is that I would like something like the following to work:
normalized_data <- ddply(bench, ~ Benchmark + Configuration + Approach,
transform,
Ratio = Time / Time[Approach == "appr2"])
So, what I actually want is to calculate the speed-up between corresponding pairs of measurements.
bench
is a data frame with the columns Time, Benchmark, Configuration and Approach and contains 100 measurements for all possible combinations of Benchmark, Configuration and Approach.
Now I got exactly two approaches and want the speed-up of "appr2"/"appr1".
Thus, just looking at one specific benchmark, and one specific configuration, I have 100 measurements for "appr1" and 100 of "appr2" in my data frame. However, R gives me the following error resulting from the give query:
Error in data.frame(list(Time = c(405.73, 342.616, 404.484, 328.742, 403.384, :
arguments imply differing number of rows: 100, 0
Ideally, the result of my query would result in a new data frame with the three columns SpeedUp, Benchmark, Configuration. Based on that I would then be able to calculate means, confidence intervals and so on.
But at the moment, the basic problem is how to express such a normalization. For another data set I was able to calculate a normalized value like this Time.norm = Time / Time[NumCores == min(NumCores)]
but looks like that worked just by chance, at least I do not understand the difference.
Any hints are appreciate. (Especially the right terminology to search for solutions for such problems.)
Edit: Thanks to Chase's hint, here a minimal data set, which should be structurally identical to what I got, and it exhibits the same behavior with respect to the query above.
bench <- structure(list(Time = c(399.04, 388.069, 401.072, 361.646),
Benchmark = structure(c(1L, 1L, 1L, 1L), .Label = c("Fibonacci"), class = "factor"),
Configuration = structure(c(1L, 1L, 1L, 1L), .Label = c("native"), class = "factor"),
Approach = structure(c(1L, 1L, 2L, 2L), .Label = c("appr1", "appr2"), class = "factor")),
.Names = c("Time", "Benchmark", "Configuration", "Approach"),
row.names = c(NA, 4L), class = "data.frame")