1

I am using the R programming language. I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.

I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions

It seems that the "microbenchmark" library in R should be able to accomplish this task.

Suppose I simulate the following data:

#load libraries

library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )


#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:

#Procedure 1: :

gower_dist <- daisy(f[,-5],
                    metric = "gower")

gower_mat <- as.matrix(gower_dist)


#Procedure 2

lof <- lof(gower_dist, k=3)

#Procedure 3

lof <- lof(gower_dist, k=5)

#Procedure 4

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 5

tsne_obj <- Rtsne(gower_dist, perplexity =10,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 6

plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())

#Procedure 7

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(
    name = f$ID, 
    lof=lof,
    var1=f$var_1,
    var2=f$var_2,
    var3=f$var_3
    )

p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1, 
  var2=var2, var3=var3), data = tsne_data) + 
  geom_point(shape=1, col="red")+
  theme_minimal()

ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))

Using the "microbenchmark" library, I can find out the time of individual functions:

procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
                    metric = "gower"))

procedure_1_part_2 <-  microbenchmark(as.matrix(gower_dist))

I want to make a graph of the run times like this:

https://umap-learn.readthedocs.io/en/latest/benchmarking.html

Question: Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?

microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

etc

There does not seem to be a straightforward way to do this in R:

mean(procedure_1_part_1$time)
[1] NA

Warning message:
In mean.default(procedure_1_part_1) :
  argument is not numeric or logical: returning NA

I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time.

 tm <- microbenchmark( daisy(f[,-5],
                        metric = "gower"),
    as.matrix(gower_dist))

 tm
Unit: microseconds
                             expr    min     lq     mean  median      uq    max neval cld
 daisy(f[, -5], metric = "gower") 2071.9 2491.4 3144.921 3563.65 3621.00 4727.8   100   b
            as.matrix(gower_dist)  129.3  147.5  194.709  180.80  232.45  414.2   100  a 

Is there a quicker way to make a graph?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83

2 Answers2

3

Here is a solution that benchmarks & charts the first three procedures from the original post, and then charts their average run times with ggplot().

Setup

We start the process by executing the code necessary to create the data from the original post.

library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
library(microbenchmark)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )

#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4,ID=1:1000)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

Automation of the benchmarking process by data frame size

First, we create a vector of data frame sizes to drive the benchmarking.

# configure run sizes
sizes <- c(5,10,50,100,200,500,1000)

Next, we take the first procedure and alter it so we can vary the number of observations that are used from the data frame f. Note that since we need to use the outputs from this procedure in subsequent steps, we use assign() to write them to the global environment. We also include the number of observations in the object name so we can retrieve them by size in subsequent steps.

# Procedure 1: :
proc1 <- function(size){
    assign(paste0("gower_dist_",size), daisy(f[1:size,-5],
                        metric = "gower"),envir = .GlobalEnv)
        
    assign(paste0("gower_mat_",size),as.matrix(get(paste0("gower_dist_",size),envir = .GlobalEnv)),
           envir = .GlobalEnv)
        
}     

To run the benchmark by data frame size we use the sizes vector with lapply() and an anonymous function that executes proc1() repeatedly. We also assign the number of observations to a column called obs so we can use it in the plot.

proc1List <- lapply(sizes,function(x){
        b <- microbenchmark(proc1(x))
        b$obs <- x
        b
})

At this point we have one data frame per benchmark based on size. We combine the benchmarks into a single data frame with do.call() and rbind().

proc1summary <- do.call(rbind,(proc1List))

Next, we use the same process with procedures 2 and 3. Notice how we use get() with paste0() to retrieve the correct gower_dist objects by size.

#Procedure 2

proc2 <- function(size){
        lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=3)
}
proc2List <- lapply(sizes,function(x){
    b <- microbenchmark(proc2(x))
    b$obs <- x
    b
})
proc2summary <- do.call(rbind,(proc2List))

#Procedure 3

proc3 <- function(size){
    lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=5)
}

Since k must be less than the number of observations, we adjust the sizes vector to start at 10 for procedure 3.

# configure run sizes
sizes <- c(10,50,100,200,500,1000)

proc3List <- lapply(sizes,function(x){
    b <- microbenchmark(proc3(x))
    b$obs <- x
    b
})
proc3summary <- do.call(rbind,(proc3List))

Having generated runtime benchmarks for each of the first three procedures, we bind the summary data, summarize to means with dplyr::summarise(), and plot with ggplot().

do.call(rbind,list(proc1summary,proc2summary,proc3summary)) %>% 
    group_by(expr,obs) %>%
    summarise(.,time_ms = mean(time) * .000001) -> proc_time 

The resulting data frame has all the information we need to produce the chart: the procedure used, the number of observations in the original data frame, and the average time in milliseconds.

> head(proc_time)
# A tibble: 6 x 3
# Groups:   expr [1]
  expr       obs time_ms
  <fct>    <dbl>   <dbl>
1 proc1(x)     5   0.612
2 proc1(x)    10   0.957
3 proc1(x)    50   1.32 
4 proc1(x)   100   2.53 
5 proc1(x)   200   5.78 
6 proc1(x)   500  25.9 

Finally, we use ggplot() to produce an x y chart, grouping the lines by procedure used.

ggplot(proc_time,aes(obs,time_ms,group = expr)) +
    geom_line(aes(group = expr),color = "grey80") + 
    geom_point(aes(color = expr))

...and the output:

enter image description here

Since procedures 2 and 3 vary only slightly, k = 3 vs. k = 5, they are almost indistinguishable in the chart.

Conclusions

With a combination of wrapper functions and lapply() we can generate the information needed to produce the chart requested in the original post.

The general pattern of modifications is:

  1. Wrap the original procedure in a function that we can use as the unit of analysis for microbenchmark(), and include a size argument
  2. Modify the procedure to use size as a variable where necessary
  3. Modify the procedure to access objects from previous steps, based on the size argument
  4. Modify the procedure to write its outputs with assign() and size if these are needed for subsequent procedure steps

We leave automation of benchmarking procedures 4 - 7 by data frame size and integrating them into the plot as an interesting exercise for the original poster.

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • I can not thank you enough for the answer you have provided. I spent the whole day learning about functions in R and debugging code. I think I am getting closer, but I am still running into some errors. I posted my attempts at benchmarking the remaining procedures over here: https://stackoverflow.com/questions/65461979/r-errors-encountered-during-loops-x-input-name-cant-be-recycled-to-size-1 but I keep getting a "Recycling error". If you have time, could you please take a look at it? Thank you again for all your help. – stats_noob Dec 27 '20 at 00:29
  • @stats555 - check your other question for an updated version of the code that runs all 7 benchmarks and generates a graph. The recycling error is due to the fact that the vectors being assigned in `mutate()` need to be subset based on the `size` parameter. – Len Greski Dec 27 '20 at 03:39
1

My first answer severely misunderstood your question. I hope this can be of some help.

library(tidyverse)
library(broom)

# Benchmark your expressions. The following script assumes you name the benchmarks as function_n, but this can (and should be) improved on.
res = microbenchmark(
  rnorm_100 = rnorm(100),
  runif_100 = runif(100),
  rnorm_1000 = runif(1000),
  runif_1000 = runif(1000)
)

# We will be using this gist to tidy the frame
# Source: https://gist.github.com/nutterb/e9e6da4525bacac99899168b5d2f07be
tidy.microbenchmark <- function(x, unit, ...){
  summary(x, unit = unit)
}

# Tidy the frame
res_tidy = tidy(res) %>% 
  mutate(expr = as.character(expr)) %>% 
  separate(expr, c("func","n"), remove = FALSE)

res_tidy
#>         expr  func    n    min      lq     mean  median      uq     max neval
#> 1  rnorm_100 rnorm  100  8.112  9.3420 10.58302 10.2915 10.9755  44.903   100
#> 2  runif_100 runif  100  4.487  5.1180  6.12284  6.1990  6.5925  10.907   100
#> 3 rnorm_1000 rnorm 1000 34.631 36.3155 37.78117 37.2665 38.4510  62.951   100
#> 4 runif_1000 runif 1000 34.668 36.6330 39.48718 37.7995 39.2905 105.325   100

# Plot the runtime for the different expressions by sample number
ggplot(res_tidy, aes(x = n, y = mean, group = func, col = func)) +
  geom_line() +
  geom_point() +
  labs(y = "Runtime", x = "n")

Created on 2020-12-26 by the reprex package (v0.3.0)

mhovd
  • 3,724
  • 2
  • 21
  • 47
  • thank you! this is a great start! I will try to see if I can adapt this code to plot run times for procedures 1-7 (for different sizes of object "f"), and then connect those lines – stats_noob Dec 26 '20 at 16:32
  • 1
    Also take a look at how to use this with `tidy` in order to easily grab the results into a data frame. There is an example in the vignette I linked, but if you want me to write something up let me know! – mhovd Dec 26 '20 at 16:33
  • thanks again! the code you provided makes a plot between variables mpg and wt ... I will try to see if I can plot the "run time" instead – stats_noob Dec 26 '20 at 16:36
  • 1
    To clarify, the code I provided was from the vignette example, showcasing the benchmark for the plot between `mpg` and `wt` from cars as an example. In your case, you will have to do that for each of the instances you want to benchmark, then plot that data to get your desired output. Let me know if I can help you. – mhovd Dec 26 '20 at 16:45
  • I found another useful stackoverflow post: https://stackoverflow.com/questions/41523644/how-can-i-plot-benchmark-output – stats_noob Dec 26 '20 at 17:20
  • I initially misunderstood your question, I am working on an edit to my answer now. – mhovd Dec 26 '20 at 17:45
  • @stats555, I updated my answer to better answer your question (I hope). My apologies for misunderstanding your question to begin with! – mhovd Dec 26 '20 at 18:00