I'm working with existing R code that my predecessor wrote. The code is for generating a PDF report to display data from test runs of our software.
One of the sets of charts I'm trying to create is supposed to chart the percent-change from a "Benchmark" result. This benchmark is supposed to simply be the earliest version we have data for.
Here is the section of code that exists currently to build the benchmark deviation chart.
library(ggplot2)
dbhandle <- SQLConn_remote(DBName = "DATABASE", ServerName = "SERVER")
Testdf<-sqlQuery(dbhandle, 'select * from TABLENAME
order by FileName, Number, Category', stringsAsFactors = FALSE)
versions<-unique(Testdf[order(Testdf$Number), ][,2])
benchmarks<-aggregate(Value~FileName, subset(Testdf, Number == 1 | Number == 2)[, c('FileName', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'
Testdf<-merge(Testdf, benchmarks)
Testdf$Version<-factor(Testdf$Version, levels = versions)
Testdf$Deviation<-Testdf$Value- Testdf$Benchmark
Testdf$DeviationP<-(Testdf$Value- Testdf$Benchmark)/Testdf$Benchmark
g<-ggplot(subset(Testdf, !is.na(Value) & Deviation <.5) , aes(color = Value, x = Version, y = Deviation, group = FileName)) + geom_line() +geom_point(aes(shape = Build), size = 1.5) +
scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") +
ylab("Run Time Deviation from Benchmark (min)") +
scale_colour_gradient(name = 'Run Time',low = 'blue', high = 'red') +
theme(axis.text.x = element_text(angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1))
g
Currently, the method of calculating the "Benchmark" value isn't working. I'm providing an example dataframe for R below if you'd like to see what the code currently does. The part that is confusing me the most is the benchmark
variable. I honestly have hardly any idea what is going on. I've never used the aggregate()
function before so the syntax is completely foreign to me and I've had a horrible time tracking down documentation for it (that I understand). The specific part that is the most confusing is subset(Testdf, Number == 1 | Number == 2)
. Originally the code had Number == 14 | Number == 15
. If I recall, |
means "or" (and the amount of Number entries was far greater in the 30+ range).
Perhaps you can help me understand a smart way to generate this chart that I'm looking to make and help me make sense of this code.
EDIT:
I'd like to get a chart where every single entry charted is for the Run Time
category and that for each FileName, the chart begins at 0 to show deviations from the original. I'd also like code to select the earliest Number
entry not just Number == 1
because sometimes there might not be an entry for Number == 1
. This is what I've come up with so far:
versions<-unique(AutoRegdf[order(AutoRegdf$TestNum), ][,2])
benchmarks<-aggregate(Value~Test_Scenario, subset(AutoRegdf, min(AutoRegdf$TestNum) & Measure == 'Run Time')[, c('Test_Scenario', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'
AutoRegdf<-merge(AutoRegdf, benchmarks)
AutoRegdf$JMPTVersion<-factor(AutoRegdf$JMPTVersion, levels = versions)
AutoRegdf$Deviation<-AutoRegdf$Value- AutoRegdf$Benchmark
AutoRegdf$DeviationP<-(AutoRegdf$Value- AutoRegdf$Benchmark)/AutoRegdf$Benchmark
g<-ggplot(subset(AutoRegdf, Measure == 'Batch Time' & !is.na(Value) & Deviation <.5) , aes(color = Value, x = JMPTVersion, y = Deviation, group = Test_Scenario)) +
geom_line(size=.25) + geom_point(aes(shape = Build), size = 1.5) +
scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") +
ylab("Run Time Deviation from Benchmark (min)") +
scale_colour_gradient(name = 'Run Time (min)',low = 'blue', high = 'red') +
theme(axis.text.x = element_text(size = 10, angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1)) +
theme(plot.margin=unit(c(0,0,0,0),"mm"))
g
If you'd like to recreate this yourself, you can use this example dataframe in R.
rw1 <- c("File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3")
rw2 <- c("0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03")
rw3 <- c("Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final")
rw4 <- c(123, 456, 789, 312, 645, 978, 741, 852, 963, 369, 258, 147, 753, 498, 951, 753, 915, 438, 978, 741, 852, 963, 369, 258, 147, 753, 498)
rw5 <- c("01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12")
rw6 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
rw7 <- c("Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release")
rw8 <- c("None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "Cannot Connect to Database", "None", "None", "None", "None", "None", "None", "None", "None")
Testdf = data.frame(rw1, rw2, rw3, rw4, rw5, rw6, rw7, rw8)
colnames(Testdf) <- c("FileName", "Version", "Category", "Value", "Date", "Number", "Build", "Error")