3

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple

I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)

suppose :

c("yop",NA,"1234567","19","12AN","PLOP","5689777")

Given for exemple an objective size of 10, i want :

[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"

as a result, as fast as possible.

I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.

Here's my actual code :

library(purrr)
zero_left <- function(field,nb){
  map2_chr(
    map(abs(nb-nchar(field)),~ rep("0",.x)),
    field,
    ~ paste0(c(.x,.y),collapse=""))
}

trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)

This code does not even treat the NA case... But without it it works, but too slow.

lrnv
  • 1,038
  • 8
  • 19

2 Answers2

5

This relies on an external package but takes 1/30 of the time your zero_left() function takes:

nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"

Edit 1:

Base-R solution that is seems probably isn't just as fast:

gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)

Edit 2:

Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:

stringi::stri_pad_left(trial, width = nb, pad = "0")

s_baldur
  • 29,441
  • 4
  • 36
  • 69
2

If speed is your concern, base R can be faster than stringr/stringi:

library(microbenchmark)
microbenchmark(
  stringr=stringr::str_pad(trial, width=nb, pad="0"),
  stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
  base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
#     expr    min     lq     mean  median      uq     max neval
#  stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470   100
#  stringi 10.473 12.359 13.15298 13.0180 13.5445  21.418   100
#     base  7.848  9.392 10.83702 10.2035 10.8980  43.620   100

The only consequence is that the NA is turned into a literal "NANA" here

paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA"       "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"

so the workaround is

microbenchmark(
      stringr=stringr::str_pad(trial, width=nb, pad="0"),
      stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
      base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
    )
# Unit: microseconds
#     expr    min      lq     mean  median      uq    max neval
#  stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096   100
#  stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135   100
#     base 10.766 11.9185 13.69714 13.0665 13.8035 87.226   100

(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)

(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I thought of this solution but 1. what if in another application `nb < max(nchar(trial))` and 2. The speed advantage is nonexistant with long trial, example: `rep(trial, 10000)`. – s_baldur Jun 28 '18 at 15:25
  • Good point, but it was specifically in the OP that: "left Zero's to a given size *(superior to the maximum number of char)*" (emphasis mine). – r2evans Jun 28 '18 at 15:38
  • All that does is change it to `paste(strrep("0", pmax(0, nb - nchar(trial))), trial, sep="")`, which does not affect the time significantly. (`strrep("0",0)` yields the empty string.) – r2evans Jun 28 '18 at 15:39