-3

I am working on extracting substrings from a dataframe faving 700,000 rows.

For example, lets say I have a variable ord which is a dataframe of size 1 X 700,000. I am using substr(ord[, 1], 23, 36) but it is very slow. Can somebody suggest some other method or improvements?

Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42
  • 1
    Hi & welcome to Stack Overflow. Because it's [common here](https://stackoverflow.com/a/28481250/6574038), would you mind to add some example data to your code? You can do this by adding the output of `dput(head(ord))` to your question. Cheers. – jay.sf Dec 30 '18 at 10:39
  • If you have a data frame with one column I would advice turning it into a vector, that would make `substr` a little bit faster. – pogibas Dec 30 '18 at 10:47
  • 1
    What is "very slow"? On my machine which is nothing special at all, it works almost instantaneously. – vaettchen Dec 30 '18 at 10:59

2 Answers2

2

Are you sure you don't have a tibble instead of a data.frame ?

For a one-variable tibble, using [, 1] will do nothing as you'll get the same one-variable tibble. Always see a data.frame or a tibble as a list (not a matrix) and use [[1]] to access the first variable as a vector.

In terms of timings (with 10 times less data):

ord_ch <- rep(replicate(700, paste(sample(letters, 40, TRUE), collapse = "")), 100)
ord_df <- data.frame(xcol = ord_ch, stringsAsFactors = FALSE)
ord_df_fct <- data.frame(xcol = ord_ch, stringsAsFactors = TRUE)
ord_tbl <- tibble::tibble(xcol = ord_ch)

microbenchmark::microbenchmark(
  substr(ord_ch, 23, 36),
  substr(ord_df[, 1], 23, 36),
  substr(ord_df_fct[, 1], 23, 36),
  substr(ord_tbl[, 1], 23, 36),
  times = 10
)

Benchmark result:

Unit: milliseconds
                            expr         min          lq        mean      median
          substr(ord_ch, 23, 36)    8.807504    8.921520    9.253258    9.321168
     substr(ord_df[, 1], 23, 36)    8.711323    8.775754    9.030802    8.965194
 substr(ord_df_fct[, 1], 23, 36)    9.337599    9.544920   10.065594    9.595284
    substr(ord_tbl[, 1], 23, 36) 1433.387037 1446.136184 1456.639754 1453.826835
          uq         max neval
    9.391774   10.077075    10
    9.167970    9.713614    10
   10.016577   12.173109    10
 1460.824234 1494.942769    10
F. Privé
  • 11,423
  • 2
  • 27
  • 78
0

Let's try substring operation of data.frame and character with base and stringi

# install.packages("rbenchmark")
# install.packages("stringi")

library(rbenchmark)
library(stringi)

ord_ch <- rep(paste0(sample(c(letters,LETTERS), 40), collapse = ""), 700000)
ord_df <- data.frame(ord_ch, stringsAsFactors = F)

# rbenchmark package can compare elapsed time by running your code several (100) times.
benchmark("dataframe" = substr(ord_df[, 1], 23, 36)
          , "character" = substr(ord_ch, 23, 36)
          , "stri_character" =  stringi::stri_sub(ord_ch, 23, 36)
          , "stri_dataframe" =  stringi::stri_sub(ord_df[, 1], 23, 36)
          , replications = 100) 

here is the result from my system. data.frame is 0.5 seconds slower than character vector. You can consider for your case if it is worth to change type.

#             test replications elapsed relative user.self sys.self user.child sys.child
#       character          100    7.61    1.000      7.37     0.19         NA        NA
#       dataframe          100    8.10    1.064      7.71     0.28         NA        NA
#  stri_character          100    7.88    1.035      7.15     0.55         NA        NA
#  stri_dataframe          100    8.16    1.072      6.94     0.64         NA        NA
Selcuk Akbas
  • 711
  • 1
  • 8
  • 20
  • That's for running the code 100 times. So doing the extraction once is less than 0.1 sec which I struggle calling very slow. Or am I missing something? – vaettchen Dec 30 '18 at 13:45