6

(Unfortunately, I am missing basic vocabulary to formulate my question. So, please correct me where more precise terms are useful.)

I use R to do very basic statistical analysis for benchmark results of virtual machines, and I often want to normalize my data based on some criterion.

Currently my problem is that I would like something like the following to work:

normalized_data <- ddply(bench, ~ Benchmark + Configuration + Approach,
                         transform,
                         Ratio = Time / Time[Approach == "appr2"])

So, what I actually want is to calculate the speed-up between corresponding pairs of measurements.

bench is a data frame with the columns Time, Benchmark, Configuration and Approach and contains 100 measurements for all possible combinations of Benchmark, Configuration and Approach. Now I got exactly two approaches and want the speed-up of "appr2"/"appr1". Thus, just looking at one specific benchmark, and one specific configuration, I have 100 measurements for "appr1" and 100 of "appr2" in my data frame. However, R gives me the following error resulting from the give query:

Error in data.frame(list(Time = c(405.73, 342.616, 404.484, 328.742, 403.384,  : 
  arguments imply differing number of rows: 100, 0

Ideally, the result of my query would result in a new data frame with the three columns SpeedUp, Benchmark, Configuration. Based on that I would then be able to calculate means, confidence intervals and so on.

But at the moment, the basic problem is how to express such a normalization. For another data set I was able to calculate a normalized value like this Time.norm = Time / Time[NumCores == min(NumCores)] but looks like that worked just by chance, at least I do not understand the difference.

Any hints are appreciate. (Especially the right terminology to search for solutions for such problems.)

Edit: Thanks to Chase's hint, here a minimal data set, which should be structurally identical to what I got, and it exhibits the same behavior with respect to the query above.

bench <- structure(list(Time = c(399.04, 388.069, 401.072, 361.646),
           Benchmark = structure(c(1L, 1L, 1L, 1L), .Label = c("Fibonacci"), class = "factor"), 
           Configuration = structure(c(1L, 1L, 1L, 1L), .Label = c("native"), class = "factor"),
           Approach = structure(c(1L, 1L, 2L, 2L), .Label = c("appr1", "appr2"), class = "factor")),
      .Names = c("Time", "Benchmark", "Configuration", "Approach"),
      row.names = c(NA, 4L), class = "data.frame")
smarr
  • 763
  • 1
  • 6
  • 17
  • Hi smarr - take a look at this question for tips on formulating a good technical question: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. Particularly, look at adding `dput(yourData)` – Chase Aug 28 '11 at 16:32
  • Thanks! I added a data set above. – smarr Aug 28 '11 at 16:46
  • Looks like I still miss quite a number of basic concepts in R. The solution lies in the used formula: `~ Benchmark + Configuration + Approach` groups the data by according to all three dimensions, and that is not what I actually need. The resulting data set did really just contain data of "appr1", and there was noting left to correlate to. So, changing the forumla to `~ Benchmark + Configuration` results in a data set that contains "appr1" and "appr2" data. And then, it works as intended :) Thanks for listening. – smarr Aug 28 '11 at 17:27
  • Glad you got it figured out. Feel free to add your comment above as an answer and accept it so others know you found a solution. – Chase Aug 28 '11 at 17:52

2 Answers2

0

If you try to do this within ddply in the manner I naively attempted at first, you find that you are only working within individual categories:

  ddply(bench, ~ Benchmark + Configuration + Approach,
                          transform,
                          Ratio = Time / mean(Time[Approach == "appr2"]) )
#------------
 Time Benchmark Configuration Approach     Ratio
1 399.040 Fibonacci        native    appr1       NaN
2 388.069 Fibonacci        native    appr1       NaN
3 401.072 Fibonacci        native    appr2 1.0516915
4 361.646 Fibonacci        native    appr2 0.9483085

Obviously not what was hoped for. You can calculate a mean value outside of bench to be the normalization factor:

 meanappr2 <- mean(subset(bench, Approach == "appr2", Time))
  ddply(bench, ~ Benchmark + Configuration + Approach,
                          transform,
                          Ratio = Time / meanappr2 )
#--------------
 Time Benchmark Configuration Approach     Ratio
1 399.040 Fibonacci        native    appr1 1.0463631
2 388.069 Fibonacci        native    appr1 1.0175950
3 401.072 Fibonacci        native    appr2 1.0516915
4 361.646 Fibonacci        native    appr2 0.9483085

If on the other hand you didn't want a line by line normalisation but rather a cross group comparison, use the "summarise" option within in the *ply operations:

  ddply(bench, ~ Benchmark + Configuration + Approach,
                          summarise,
                          Ratio = mean(Time) / meanappr2 )
#-----------
  Benchmark Configuration Approach    Ratio
1 Fibonacci        native    appr1 1.031979
2 Fibonacci        native    appr2 1.000000
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Sorry, I was not clear enough about what I intended. I found a solution to my problem, and posted it as an answer. Still, many thanks! – smarr Aug 29 '11 at 07:14
0

Looks like I still miss quite a number of basic concepts in R.

The solution lies in the used formula: ~ Benchmark + Configuration + Approach groups the data according to all three dimensions, and that is not what I actually need. The resulting data set did really just contain data of "appr1", and there was noting left to correlate to.

So, changing the forumla to ~ Benchmark + Configuration results in a data set that contains "appr1" and "appr2" data for all Time measurements. And then, it works as intended :)

smarr
  • 763
  • 1
  • 6
  • 17