R string operation : How could i optimize this one?

Question

TL;DR : I want to complete each string of a list to a given size by a given character on left. I want it fast. See code below and exemple

I have veeeeery large vector of strings, containing... well anything, but with a maximum (known) number of character. I want to complete thoose strings by left Zero's to a given size (superior to the maximum number of char)

suppose :

c("yop",NA,"1234567","19","12AN","PLOP","5689777")

Given for exemple an objective size of 10, i want :

[1] "0000000yop" NA "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"

as a result, as fast as possible.

I've tried to write my own, but it's not really fast... Could you help me making it faster ? I have billions of thoose to treat.

Here's my actual code :

library(purrr)
zero_left <- function(field,nb){
  map2_chr(
    map(abs(nb-nchar(field)),~ rep("0",.x)),
    field,
    ~ paste0(c(.x,.y),collapse=""))
}

trial <- c("yop","1234567","19","12AN","PLOP","5689777")
zero_left(trial,10)

This code does not even treat the NA case... But without it it works, but too slow.

(It might be nice if you retroactively accept some of [your previous-questions' answers](https://stackoverflow.com/users/8425270/pilou-lnrv), too. Thanks!) — r2evans, Jul 03 '18 at 21:10

s_baldur · Answer 1 · 2018-06-28T15:27:38.430

5

This relies on an external package but takes 1/30 of the time your zero_left() function takes:

nb <- 10
stringr::str_pad(trial, width=nb, pad="0")
[1] "0000000yop" "0001234567" "0000000019" "00000012AN" "000000PLOP" "0005689777"

Edit 1:

Base-R solution that is ~~seems~~ probably isn't ~~just~~ as fast:

gsub(pattern = " ", replacement = "0", sprintf("%*s", nb, trial), fixed = TRUE)

Edit 2:

Remembering that stringr is just a wrapper for stringi functions you can get another speedboost by using stringi directly:

stringi::stri_pad_left(trial, width = nb, pad = "0")

edited Jun 28 '18 at 15:27

answered Jun 28 '18 at 14:03

s_baldur

29,441
4
36
69

Just perfect... How did'nt i saw this one first ? Thank you very much :) – lrnv Jun 28 '18 at 14:09

score 2 · Accepted Answer · answered Jun 28 '18 at 15:09

If speed is your concern, base R can be faster than stringr/stringi:

library(microbenchmark)
microbenchmark(
  stringr=stringr::str_pad(trial, width=nb, pad="0"),
  stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
  base=paste(strrep("0", nb - nchar(trial)), trial, sep="")
)
# Unit: microseconds
#     expr    min     lq     mean  median      uq     max neval
#  stringr 21.292 22.747 24.87188 23.7070 24.4735 129.470   100
#  stringi 10.473 12.359 13.15298 13.0180 13.5445  21.418   100
#     base  7.848  9.392 10.83702 10.2035 10.8980  43.620   100

The only consequence is that the NA is turned into a literal "NANA" here

paste(strrep("0", nb - nchar(trial)), trial, sep="")
# [1] "0000000yop" "NANA"       "0001234567" "0000000019" "00000012AN"
# [6] "000000PLOP" "0005689777"

so the workaround is

microbenchmark(
      stringr=stringr::str_pad(trial, width=nb, pad="0"),
      stringi=stringi::stri_pad_left(trial, width = nb, pad = "0"),
      base={v=paste(strrep("0", nb - nchar(trial)), trial, sep="");v[is.na(trial)]=NA;}
    )
# Unit: microseconds
#     expr    min      lq     mean  median      uq    max neval
#  stringr 20.657 22.6440 23.99204 23.3870 24.6190 60.096   100
#  stringi 10.980 12.1585 13.57061 13.0790 13.7800 64.135   100
#     base 10.766 11.9185 13.69714 13.0665 13.8035 87.226   100

(Which makes base R about as fast as stringi and slightly faster than stringr, in this case.)

(I'm mildly annoyed that paste converts NA to "NA", though that's already been addressed here on SO.)

I thought of this solution but 1. what if in another application `nb < max(nchar(trial))` and 2. The speed advantage is nonexistant with long trial, example: `rep(trial, 10000)`. — s_baldur, Jun 28 '18 at 15:25
Good point, but it was specifically in the OP that: "left Zero's to a given size *(superior to the maximum number of char)*" (emphasis mine). — r2evans, Jun 28 '18 at 15:38
All that does is change it to `paste(strrep("0", pmax(0, nb - nchar(trial))), trial, sep="")`, which does not affect the time significantly. (`strrep("0",0)` yields the empty string.) — r2evans, Jun 28 '18 at 15:39

R string operation : How could i optimize this one?

2 Answers2