1

I'm working with existing R code that my predecessor wrote. The code is for generating a PDF report to display data from test runs of our software.

One of the sets of charts I'm trying to create is supposed to chart the percent-change from a "Benchmark" result. This benchmark is supposed to simply be the earliest version we have data for.

Here is the section of code that exists currently to build the benchmark deviation chart.

library(ggplot2)

dbhandle <- SQLConn_remote(DBName = "DATABASE", ServerName = "SERVER")
Testdf<-sqlQuery(dbhandle, 'select * from TABLENAME 
                order by FileName, Number, Category', stringsAsFactors = FALSE)
versions<-unique(Testdf[order(Testdf$Number), ][,2])

benchmarks<-aggregate(Value~FileName, subset(Testdf, Number == 1 | Number == 2)[, c('FileName', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'

Testdf<-merge(Testdf, benchmarks)
Testdf$Version<-factor(Testdf$Version, levels = versions)
Testdf$Deviation<-Testdf$Value- Testdf$Benchmark
Testdf$DeviationP<-(Testdf$Value- Testdf$Benchmark)/Testdf$Benchmark

g<-ggplot(subset(Testdf, !is.na(Value) & Deviation <.5) , aes(color = Value, x = Version, y = Deviation, group = FileName)) + geom_line() +geom_point(aes(shape = Build), size = 1.5) +
  scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") + 
  ylab("Run Time Deviation from Benchmark (min)") +  
  scale_colour_gradient(name = 'Run Time',low = 'blue', high = 'red') + 
  theme(axis.text.x = element_text(angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1))
g

Currently, the method of calculating the "Benchmark" value isn't working. I'm providing an example dataframe for R below if you'd like to see what the code currently does. The part that is confusing me the most is the benchmark variable. I honestly have hardly any idea what is going on. I've never used the aggregate() function before so the syntax is completely foreign to me and I've had a horrible time tracking down documentation for it (that I understand). The specific part that is the most confusing is subset(Testdf, Number == 1 | Number == 2). Originally the code had Number == 14 | Number == 15. If I recall, | means "or" (and the amount of Number entries was far greater in the 30+ range).

Perhaps you can help me understand a smart way to generate this chart that I'm looking to make and help me make sense of this code.

EDIT:

I'd like to get a chart where every single entry charted is for the Run Time category and that for each FileName, the chart begins at 0 to show deviations from the original. I'd also like code to select the earliest Number entry not just Number == 1 because sometimes there might not be an entry for Number == 1. This is what I've come up with so far:

versions<-unique(AutoRegdf[order(AutoRegdf$TestNum), ][,2])

benchmarks<-aggregate(Value~Test_Scenario, subset(AutoRegdf, min(AutoRegdf$TestNum) & Measure == 'Run Time')[, c('Test_Scenario', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'

AutoRegdf<-merge(AutoRegdf, benchmarks)
AutoRegdf$JMPTVersion<-factor(AutoRegdf$JMPTVersion, levels = versions)
AutoRegdf$Deviation<-AutoRegdf$Value- AutoRegdf$Benchmark
AutoRegdf$DeviationP<-(AutoRegdf$Value- AutoRegdf$Benchmark)/AutoRegdf$Benchmark

g<-ggplot(subset(AutoRegdf, Measure == 'Batch Time' & !is.na(Value) & Deviation <.5) , aes(color = Value, x = JMPTVersion, y = Deviation, group = Test_Scenario)) + 
  geom_line(size=.25) + geom_point(aes(shape = Build), size = 1.5) +
  scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") + 
  ylab("Run Time Deviation from Benchmark (min)") +  
  scale_colour_gradient(name = 'Run Time (min)',low = 'blue', high = 'red') + 
  theme(axis.text.x = element_text(size = 10, angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1)) + 
  theme(plot.margin=unit(c(0,0,0,0),"mm"))
g

If you'd like to recreate this yourself, you can use this example dataframe in R.

rw1 <- c("File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3")
rw2 <- c("0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03")
rw3 <- c("Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final")
rw4 <- c(123, 456, 789, 312, 645, 978, 741, 852, 963, 369, 258, 147, 753, 498, 951, 753, 915, 438, 978, 741, 852, 963, 369, 258, 147, 753, 498)
rw5 <- c("01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12")
rw6 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
rw7 <- c("Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release")
rw8 <- c("None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "Cannot Connect to Database", "None", "None", "None", "None", "None", "None", "None", "None")


Testdf = data.frame(rw1, rw2, rw3, rw4, rw5, rw6, rw7, rw8)
colnames(Testdf) <- c("FileName", "Version", "Category", "Value", "Date", "Number", "Build", "Error") 
Community
  • 1
  • 1
JohnN
  • 968
  • 4
  • 13
  • 35
  • 1
    You are right, `|` means OR. `aggregate` applies a function to a slice of data.frame. The slice is defined on the right side of the tilde `~`. `Number == 1 | Number == 2` can also be written as `Number %in% c(1, 2)`. – Roman Luštrik Jul 06 '15 at 20:28

1 Answers1

2

I'm assuming your problem is specifically with calculating the benchmark variable.

First, it appears that the intention was to calculate the mean of Value for all rows where number == 1 or number == 2, by file.

This is accomplished in two steps.

  1. subset(Testdf, Number == 1 | Number == 2)[, c('FileName', 'Value')] which returns the rows where number is 1 or 2, and the columns FileName and Value.
  2. aggregate(Value~FileName,subset(*as above*), mean) which takes the mean of Value, by Filename. Since we filtered, it only considers rows that meet the number criteria.

The line written results in:

>benchmarks 
  FileName Benchmark
1    File1 357.0
2    File2 689.5
3    File3 777.0

They then merge it back to the frame on file name. More explicit code here would be:

Testdf<-merge(Testdf, benchmarks, by = "FileName")

This results in a data frame that looks like:

 FileName Version Category Value     Date Number     Build Error Benchmark
1    File1    0.01     Time   123 01/01/12      1 Iteration  None       357
2    File1    0.01     Size   456 01/01/12      1 Iteration  None       357
3    File1    0.01    Final   789 01/01/12      1 Iteration  None       357
4    File1    0.02    Final   147 01/01/12      2 Iteration  None       357
5    File1    0.03    Final   852 01/01/12      3   Release  None       357
6    File1    0.02     Time   369 01/01/12      2 Iteration  None       357

Each row then has the mean of Value for that filename.

They then calculate the deviation from this benchmark, both as a % and a #.

Alternate Way

The data.table syntax may be easier to understand:

library(data.table)
setDT(Testdf)
Testdf[, Benchmark := mean(Value[Number == 1 | Number == 2]), by = "FileName"]

Breaking this down:

Testdf[, Because there is nothing to the left of the comma, we are applying this to every row

Benchmark := mean(Value[Number == 1 | Number == 2]) This creates a new column called benchmark. The value of benchmark is the mean of the column Value, but only for rows where number is 1 or 2

, by = "FileName"] We will calculate benchmark seperately for each filename. One way to think about this is we will take all rows where filename == File1, and then take the mean of Value. Then take all rows where filename == File2 and do the same thing. The by= argument does this for every unique value of FileName.

Next Steps

The question is: What should the code do? Is taking the mean the right benchmark? If so, the code above works. The graph looks to be a mess, so there might be an issue with your ggplot code. Clarifying this more will help us help you.

Chris
  • 6,302
  • 1
  • 27
  • 54
  • Thanks for the awesome answer. I noticed that the graph is a mess. I thought that `Number ==1 | Number ==2` was selecting two values to use as a benchmark. This is not the correct way I want to do this. I want to use the earliest entry as a benchmark. This way the chart will plot the deviation from the original value. The catch is that I don't want to just select `Number == 1` because sometimes there are errors and crashes that lead to no Value at all for an entry. If there is no entry in `Number == 1` then the code should move to `Number == 2` – JohnN Jul 07 '15 at 14:38
  • @David how would "earliest" be represented in the data? – Chris Jul 07 '15 at 15:37
  • It would be the lowest `Number` found in the column for that particular `FileName`. In other words, the code should set the benchmark according to whatever the minimum number is in the `Number` column. – JohnN Jul 07 '15 at 16:59
  • Also, how could I make it so that the only Category being graphed was the `Time` category? – JohnN Jul 07 '15 at 17:52
  • @David Look at the example you provided, you need to come up with a method to deal with conflicts. I.e. there are three number 1s in the first few rows, all for file1. Which one should be used? For the graph question, I would suggest starting a new question, as it is more likely to be answered than if nested in an existing unrelated question – Chris Jul 07 '15 at 19:04
  • That's fine. I tried this `subset(Testdf, (Number == 1 | Number == 2) & Category== 'Time')`. I am however interested in figuring out how to come up with a way to set this for Number = (earliest or minimum value). That was part of the original question. – JohnN Jul 07 '15 at 19:28
  • Is there a way that I could create a variable `minValue` that is set up something like this: `minValue<-min(Testdf[,6])` and use that instead of `Number == 1 | Number == 2` – JohnN Jul 08 '15 at 15:42
  • @David does `Testdf[, Benchmark := Value[which.min(Number)], by = "FileName"]` do what you need? What this code is doing is finding out which value for `Number` is the min for each file, and then taking the `Value` in that row. Again, since you have multiple Values for File1,Number1, it will take the first instance for the Benchmark. If that is not the required behaviour, you should find a way to deal with conflicts – Chris Jul 08 '15 at 19:20
  • I can't find a package called `data.table` – JohnN Jul 08 '15 at 19:33
  • @David you will need to install it from CRAN. use: `install.packages("data.table")` – Chris Jul 08 '15 at 20:26
  • yeah I did that and its not finding any data.table package – JohnN Jul 09 '15 at 14:01
  • Okay, for some reason yesterday it wasn't finding that package but I got it installed today. Its not working however. Its better than it was but there are still several lines on the graph which do not start at zero. Also, there is now a table that looks like the data frame being displayed in the PDF now. I don't want that there. – JohnN Jul 09 '15 at 14:48
  • @David I am more than happy to help, but this seems to have progressed beyond the scope of the initial question, and answering here will not help others with similar problems (part of what makes SO great). It seems like you now have two new problems: your graph not displaying correctly, and a data frame being inadvertently displayed in a Markdown (?) instance. I would try and formulate these into new questions, or search if they have been asked before. It will also be easier for us to help as these issues are difficult to solve with only the code given above. – Chris Jul 09 '15 at 18:06
  • @David as well, in response, please do not edit the question above (as it breaks the Q/A as it exists now. If you link the new one here I'm happy to take a look at it elsewhere – Chris Jul 09 '15 at 18:07
  • http://stackoverflow.com/questions/31321573/chart-for-benchmark-data-isnt-calculated-properly – JohnN Jul 09 '15 at 18:41
  • http://stackoverflow.com/questions/31323517/selecting-a-unique-value-from-an-r-data-frame – JohnN Jul 09 '15 at 18:41