Speeding up a function used to create convenient groups for mapping and charting

Question

Background

Following useful discussions, and help that I received from the SO colleagues with respect to:

I have combined a convenience function. That takes a numeric vector and generates factored vector pertaining to groups.

Function

The body of the function is given below.

nice.cuts <- function(variable, cuts = 10, thousands.separator = FALSE) {

  # Load required packages (useful when used independently of context)
  Vectorize(require)(package = c("gsubfn", "Hmisc", "scales"),
                     character.only = TRUE)

  # Destring this variable
  destring <- function(x) {
    ## convert factor to strings
    if (is.character(x)) {
      as.numeric(x)
    } else if (is.factor(x)) {
      as.numeric(levels(x))[x]
    } else if (is.numeric(x)) {
      x
    } else {
      stop("could not convert to numeric")
    }
  }

  # Apply function
  variable <- destring(variable)

  # Check whether to disable scientific notation
  if (mean(variable) > 100000) {
    options(scipen = 999)
  } else {
    options(scipen = 0)
  }

  # Create pretty breaks
  cut_breaks <- pretty_breaks(n = cuts)(variable)

  # Round it two decimal places
  variable <- round(variable, digits = 2)

  # Develop cuts according to the provided object
  cuts_variable <- cut2(x = variable, cuts = cut_breaks)

  # Check if variable is total or with decimals
  if (all(cut_breaks %% 1 == 0)) {
    # Variable is integer
    clean_cuts <- gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+',
                         ~paste0(x, '-',as.numeric(y)-1),
                         as.character(cuts_variable))
  } else {
    # Variable is not integer
    # Create clean cuts
    clean_cuts <- gsubfn('\\[\\s*([0-9]+\\.*[0-9]*),\\s*(\\d+\\.\\d+).*',
                         ~paste0(x, '-', as.numeric(y)- 0.01),
                         as.character(cuts_variable))
  }

  # Clean Inf
  clean_cuts <- gsub("Inf", max(variable), clean_cuts)

  # Clean punctuation
  clean_cuts <- sub("\\[(.*), (.*)\\]", "\\1 - \\2", clean_cuts)

  # Replace strings with spaces
  clean_cuts <- gsub("-"," - ",clean_cuts, fixed = TRUE)

  # Trim white spaces
  clean_cuts <- trimws(clean_cuts)

  # Order factor before returning
  clean_cuts <- factor(clean_cuts, levels = unique(clean_cuts[order(variable)]))

  if (thousands.separator == TRUE) {
    levels(clean_cuts) <- sapply(strsplit(levels(clean_cuts), " - "),
                                 function(x) paste(prettyNum(x,
                                                             big.mark = ",",
                                                             preserve.width = "none"),
                                                   collapse = " - "))
  }

  # Return
  return(clean_cuts)
}

Results

The function is extremely useful when generating factors used for mapping. For example for the following values:

set.seed(1)
dta <- data.frame(values=floor(runif(100, 10000,90000)))

The function will generate pretty breaks

> dta$cuts <- nice.cuts(dta$values, thousands.separator = TRUE)
> t(t(table(dta$cuts))) #' t() for presentation

                  [,1]
  10,000 - 19,999    9
  20,000 - 29,999   11
  30,000 - 39,999   12
  40,000 - 49,999   20
  50,000 - 59,999    6
  60,000 - 69,999   15
  70,000 - 79,999   17
  80,000 - 89,999   10

Which can be used to generate amazing legends:

This is extremely useful when generating data for choropleth maps and I use it all the time.

Problem

The challenge pertains to poor performance. The function seems to be very slow.

Really small data set

Performance is not amazing for a small data set of 100 observations:

> require(microbenchmark)
> dta <- data.frame(values=floor(runif(100, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE))
Unit: milliseconds
                                              expr      min       lq     mean   median       uq      max neval
 nice.cuts(dta$values, thousands.separator = TRUE) 32.67988 58.25709 99.26317 95.25195 136.7998 222.2178   100

Small data sets

and becomes really slow for even slightly bigger data sets:

> dta <- data.frame(values=floor(runif(1000, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE),
+                times = 10)
Unit: milliseconds
                                              expr      min       lq     mean   median       uq      max neval
 nice.cuts(dta$values, thousands.separator = TRUE) 428.6821 901.2123 1154.097 1068.845 1679.052 1708.836    10

My question is hence fairly simple, I want to keep the current functionalities of the nice.cuts function but I want to make it run faster.

Suggestions

I reckon that the gsubfn element takes quite a lot of time but I haven't figured out the way how could I make it more efficient.
I'm also thinking that taking unique values of the variable may be speed up the things a little. In my real data I often work with vector where certain values are repeated

This seems to be more fitting for the [Code Review](http://codereview.stackexchange.com/) sister site. If you want to have it here, you should boil it down to the code problem, i.e., something more minimal. — Roland, Jan 15 '16 at 13:36
@Roland Thanks for the suggestion, I will be happy to delete and repost at the *CR*. Maybe I will wait a few minutes to see if there is interest amongst the members of the SO community. I was misled by the availability of the [tag:performance] tag as this is where my issue is. — Konrad, Jan 15 '16 at 13:48
@NicE thanks very much for your suggestion. As a matter of fact I'm interested in building an ordered factor vector. In effect, I often use the function on values already available in `data.frames` after `ggfortify` from shape files so I want to have is a factor column for all the observations that has bracket values instead of the core values. The basic idea is to have a neat set of a few colours to apply to map/chart legend. — Konrad, Jan 15 '16 at 13:50

Jan van der Laan · Accepted Answer · 2016-01-15T14:30:41.773

You do all of the cleanup of the labels on the complete input vector: you first generate a character vector after cut2 after which you do a large number of regular expressions on this vector. However, you are only modifying the labels.

Therefore, after generating cut_breaks, I would first generate the labels in the correct format: cut_labels. I have done this in a new version of cut.labels below. Benchmarking against the original shows a huge improvement:

> require(microbenchmark)
> dta <- data.frame(values=floor(runif(1000, 10000,90000)))
> microbenchmark(nice.cuts(dta$values, thousands.separator = TRUE),
+   nice.cuts2(dta$values, thousands.separator = TRUE))
Unit: milliseconds
                                               expr      min        lq     mean    median        uq        max neval cld
  nice.cuts(dta$values, thousands.separator = TRUE) 720.1378 815.51782 902.9218 923.97881 968.39036 1208.00434   100   b
 nice.cuts2(dta$values, thousands.separator = TRUE)  11.4147  15.18232  16.6196  16.46937  17.05305   29.91089   100  a 
>

New version of nice.cuts

I took the labels of cuts_variable and applied all of the steps of the original function to these labels. I then overwrite the labels of cuts_variable with these new labels.

nice.cuts2 <- function(variable, cuts = 10, thousands.separator = FALSE) {

  # Load required packages (useful when used independently of context)
  Vectorize(require)(package = c("gsubfn", "Hmisc", "scales"),
                     character.only = TRUE)

  # Destring this variable
  destring <- function(x) {
    ## convert factor to strings
    if (is.character(x)) {
      as.numeric(x)
    } else if (is.factor(x)) {
      as.numeric(levels(x))[x]
    } else if (is.numeric(x)) {
      x
    } else {
      stop("could not convert to numeric")
    }
  }

  # Apply function
  variable <- destring(variable)

  # Check whether to disable scientific notation
  if (mean(variable) > 100000) {
    options(scipen = 999)
  } else {
    options(scipen = 0)
  }

  # Create pretty breaks
  cut_breaks <- pretty_breaks(n = cuts)(variable)

  # Round it two decimal places
  variable <- round(variable, digits = 2)

  # Develop cuts according to the provided object
  cuts_variable <- cut2(x = variable, cuts = cut_breaks)

  cuts_labels <- levels(cuts_variable)

  # Check if variable is total or with decimals
  if (all(cut_breaks %% 1 == 0)) {
    # Variable is integer
    cuts_labels <- gsubfn('\\[\\s*(\\d+),\\s*(\\d+)[^0-9]+',
                         ~paste0(x, '-',as.numeric(y)-1),
                         as.character(cuts_labels))
  } else {
    # Variable is not integer
    # Create clean cuts
    cuts_labels <- gsubfn('\\[\\s*([0-9]+\\.*[0-9]*),\\s*(\\d+\\.\\d+).*',
                         ~paste0(x, '-', as.numeric(y)- 0.01),
                         as.character(cuts_labels))
  }

  # Clean Inf
  cuts_labels <- gsub("Inf", max(variable), cuts_labels)

  # Clean punctuation
  cuts_labels <- sub("\\[(.*), (.*)\\]", "\\1 - \\2", cuts_labels)

  # Replace strings with spaces
  cuts_labels <- gsub("-"," - ",cuts_labels, fixed = TRUE)

  # Trim white spaces
  cuts_labels <- trimws(cuts_labels)


  if (thousands.separator == TRUE) {
    cuts_labels <- sapply(strsplit(cuts_labels, " - "),
                                 function(x) paste(prettyNum(x,
                                                             big.mark = ",",
                                                             preserve.width = "none"),
                                                   collapse = " - "))
  }

  levels(cuts_variable) <- cuts_labels
  cuts_variable
}

Thanks very for showing the interest and the valuable contribution. Would you care to post the benchmark results to show the gain on speed? — Konrad, Jan 15 '16 at 13:53
@Konrad Added a new version of `nice.cuts` and benchmarked it against the original. — Jan van der Laan, Jan 15 '16 at 14:31
Thanks very much for the comments, that's a really significant gain on performance! — Konrad, Jan 15 '16 at 14:37