Parsing text in a single column

Question

I'm trying to take the term in the variable column, and parse the "psi#." off leaving the rest. This variables will change over time.

I tried:

df <- psi2 <-  as.data.frame(piecewise_seg2$psi) %>%
 rownames_to_column( var = "variable") %>%
 separate(variable, c("psi*"))

However, that just leaves "psi." I don't know regex but I did try

str_split_fixed(psi2$variable, "psi*", "[abc]+$", 2)

That didn't work either.

I did try to find something like this but mostly found parsing one character vector into a list. Any help?

I think you want to *remove* it, not *parse* it. (Parse has a very specific meaning in programming.) How about `str_replace(yourdata$variable, pattern = "psi..", replacement = "")`. In regex, `.` matches any single character, so that will match `psi` and the next two characters (which look to be a number and a dot, in your example). — Gregor Thomas, Jan 26 '18 at 15:15
When asking a question, it is always good to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). That makes it a lot easier for others to help you. — Jaap, Jan 26 '18 at 15:17

score 3 · Accepted Answer · edited Jan 26 '18 at 16:10

If you want to just remove the psi1. with different numbers you can use str_replace:

df <- data.frame(var = c("psi2.1", "psi1.2", "psi33.55", "psi12.42"))
df %>% mutate(var = str_replace(var, "psi(\\d+)\\.", ""))
#   var
# 1   1
# 2   2
# 3  55
# 4  42

Solution by @Jaap:

gsub('psi\\d+\\.', '', psi2$variable)

And benchmark, I have added stringi::stri_replace_first_regex and perl = TRUE to gsub:

microbenchmark::microbenchmark(
  str_replace = str_replace(df$var, "psi\\d+\\.", ""),
  stri_replace_all = stringi::stri_replace_first_regex(df$var, "psi\\d+\\.", ""),
  sub = sub(".*\\.", "", df$var),
  gsub = gsub('psi\\d+\\.', '', df$var),
  gsub_perl = gsub('psi\\d+\\.', '', df$var, perl = TRUE),
  times = 10000
)

Unit: microseconds
             expr    min      lq      mean  median      uq       max neval
      str_replace 96.661 106.101 129.08727 110.632 117.805  3951.009 10000
 stri_replace_all 28.319  33.228  41.57426  36.626  39.647  1980.413 10000
              sub 14.349  17.369  22.21423  19.257  23.033  1682.124 10000
             gsub 18.879  22.278  34.89121  24.921  28.697 63495.163 10000
        gsub_perl 76.272  79.293  88.32751  81.558  84.956  1865.251 10000

The sub solution is the fastest.

Applying the benchmark on a larger dataset:

df <- df[sample(nrow(df), 1e6, replace = TRUE), , drop = FALSE]

microbenchmark::microbenchmark(
  str_replace = str_replace(df$var, "psi\\d+\\.", ""),
  stri_replace_all = stringi::stri_replace_first_regex(df$var, "psi\\d+\\.", ""),
  sub = sub(".*\\.", "", df$var),
  gsub = gsub('psi\\d+\\.', '', df$var),
  gsub_perl = gsub('psi\\d+\\.', '', df$var, perl = TRUE),
  times = 50
)

the result:

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval  cld
      str_replace 293.2773 301.9520 311.9032 308.5192 322.4974 344.7649    50  b  
 stri_replace_all 294.8729 298.8479 316.9213 306.4369 317.3555 518.5287    50  b  
              sub 468.2134 473.1803 487.0336 485.1354 498.1503 527.2476    50   c 
             gsub 649.6209 673.4312 690.7942 683.6022 701.3134 909.2599    50    d
        gsub_perl 251.0663 255.1404 263.9778 260.3426 274.6684 287.3492    50 a

`sub` will be fast than `gsub` since it is not greedy. That for sure I know. But it is uncertain why `sub` will be faster than `str_replace` yet the essence of the package are to make base functions faster — Onyambu, Jan 26 '18 at 15:40
@Onyambu The essence of stringr is _not_ to base functions faster. It's to provide a consistent interface — hadley, Jan 26 '18 at 22:04
@Hadley yes for sure, though I believe you did optimize it in terms of speed thus to my opinion. It should be faster than the base R. When a function takes ages to run in base R we usually tend to run to the various packages for alternatives — Onyambu, Jan 26 '18 at 22:12

Onyambu · Answer 2 · 2018-01-26T15:29:37.163

3

if you want to remove the psi. you can use sub function in base R

j=c("psi1.rba_bucket","psi2.rba_bucket","psi1.credit_tier_bucket")
sub(".*\\.","",j)
[1] "rba_bucket"         "rba_bucket"        
[3] "credit_tier_bucket"

sub("psi..","",j)
[1] "rba_bucket"         "rba_bucket"        
[3] "credit_tier_bucket"

edited Jan 26 '18 at 15:29

answered Jan 26 '18 at 15:17

Onyambu

67,392
3
24
53

Hello @Onyambu. What if I wanted to parse psi, the number, and the rest into separate columns? psi # variable psi 1 rba_bucket psi 2 rba_bucket – Jordan Jan 26 '18 at 18:24
do `strcapture("([a-z]+)(\\d+).(.*)",j,data.frame(A=character(),B=numeric(),C=character()))` – Onyambu Jan 26 '18 at 18:31

Parsing text in a single column

2 Answers2