2

I have to split a string by the delimiter "-", and take out the part on the far right.

SKU <- c("PPM-UA-L", "RVK-JI-XL", "KMN-WO-XS", "YYL-S")

However, in below codes, [ ,3] will not work for all cases, as some of them have only one "-". In below example, the last value "YYL-S" will return nothing.

size <- str_split(SKU, "-", simplify = T)[ ,3]

I also tried this to do backward indexing, but got error message. Also tried [ , -1], but negative index number in R does not indicate counting backwards.

size <- str_split(SKU, "-", simplify = T)(rev[ ,3])

3 Answers3

4

Vectorised string operations are faster than creating and destroying objects in memory (see benchmarks below)

Solutions which create lists of vectors that you do not need tend to be relatively slow. You can use regular expressions here to replace everything up to and including the final -.

sub(pattern = "^.+-", replacement = "", SKU)
# [1] "L"  "XL" "XS" "S" 

The caret (^) is a regex metacharacter which matches the beginning of the string. The matches any character except a new line. The + means "match the preceding character one or more times". The .+ combination is greedy, meaning it will find the longest match from the start to the end of the string. All together this means, match from the beginning of the string until and including the final -.

The sub() function replaces the first occurrence of the pattern in x (which in this case is SKU) with the replacement (which in this case is a blank string).

You can read more here about the syntax used in regular expressions.

Benchmarking

I benchmarked five approaches:

  1. Base R sub().
  2. Base R strsplit() |> sapply().
  3. Base R strsplit() |> vapply().
  4. stringr::str_split_i().
  5. stringr::str_split() |> vapply(\(x) tail(x, 1), character(1)).
  6. base R lookbehind: regmatches(gregexpr().
  7. stringr::str_extract() lookbehind.

I repeated the vector from 10 to 1e5 times. sub() is consistently the fastest approach with the least garbage collection (gc), i.e. fewest memory allocations.

There is not much difference between base::strsplit() and stringr::str_split(). sapply does not appear different to vapply(). stringr::str_split_i() is faster than the other approaches which split the vector, and has less garbage collection, but is not as fast as sub().

stringr::str_extract() with a lookbehind is almost as fast as sub(). Using the same pattern in base R with regmatches(gregexpr()) is much slower (presumably because it returns a list).

enter image description here

Code to generate the plot

results <- bench::press(
    rep_num = rep_nums,
    {
        x <- rep(SKU, rep_num)
        bench::mark(
            min_iterations = 10,
            sub = {
                sub("^.+-", "", x)
            },
            strsplit_base_sapply = {
                strsplit(x, "-") |>
                    sapply(tail, 1)
            },
            strsplit_base_vapply = {
                strsplit(x, "-") |>
                    vapply(\(x) tail(x, 1), character(1))
            },
            str_split_i = {
                str_split_i(x, "-", -1)
            },
            str_split_vapply = {
                str_split(x, "-") |>
                    vapply(\(x) tail(x, 1), character(1))
            },
            base_r_lookbehind = {
                regmatches(
                    x,
                    gregexpr("(?<=-)[^-]+$", x, perl = TRUE)
                ) |> unlist()
            },
            stringr_lookbehind = {
                str_extract(x, "(?<=-)[^-]+$")
            }
        )
    }
)


library(ggplot2)
autoplot(results) +
    theme_bw() +
    facet_wrap(vars(rep_num), scales = "free_x")

SamR
  • 8,826
  • 3
  • 11
  • 33
3

You can use str_split_i with i = -1 to get the last part:

library(stringr) #1.5.0
str_split_i(SKU, "-", -1)
# [1] "L"  "XL" "XS" "S" 
Maël
  • 45,206
  • 3
  • 29
  • 67
  • I am not familiar with this function. What package is it from? – SamR Feb 16 '23 at 15:53
  • 1
    It's stringr, latest version – Maël Feb 16 '23 at 15:55
  • I can't run this function, str_split_i. How do I get the latest version of STRINGR library? Thanks! – SilverSpringbb Feb 16 '23 at 16:04
  • You can update the packages, (google it) or use `install.packages("stringr")` again – Maël Feb 16 '23 at 16:07
  • @Maël Thanks. I tried that by installing stringr again, but it ended up messing up my other libraries :-( Now I can't get my dplyr or ggplot2 to work, and still str_split_i does not work either. How do I fix these issues? I apologize for this. – SilverSpringbb Feb 16 '23 at 16:20
  • It usually involves updating other packages as well. I'd suggest to update everything, especially since `dplyr` has released several useful function in the last weeks – Maël Feb 16 '23 at 17:13
  • 1
    @Maël thanks - clearly didn't have the latest version. Just tried it out - seems faster than `str_split()` but slower than `sub()`. – SamR Feb 16 '23 at 17:37
  • @Maël By "update everything", does that mean I have to re-install all other packages? Thanks. – SilverSpringbb Feb 16 '23 at 17:41
  • @Maël I just re-installed stringr, but that function str_split_i is still not woking. – SilverSpringbb Feb 16 '23 at 17:43
  • Probably, yes. Packages usually relies on other packages, and if those are not updated, then you can't update the others. Check if your `stringr` version is 1.5.0 – Maël Feb 16 '23 at 17:45
3

Why not use str_extract with lookbehind (?<=-), a negative character class disallowing the - character and, finally, the last-position anchor $:

library(stringr)
str_extract(SKU, "(?<=-)[^-]+$")
[1] "L"  "XL" "XS" "S" 

To simplify (and perhaps to speed up) things, we can drop the look-behind entirely and rely solely on the negative chracter class in combination with the string-end anchor $:

str_extract(SKU, "[^-]+$")
[1] "L"  "XL" "XS" "S"

Here then, str_extract extracts that substring that does not include a - and that ends when the whole string ends

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Nice - I added this to the benchmark in my answer. Interestingly (to me at least) this is a lot faster than the only way I could think of to do an equivalent thing in base R, `regmatches(SKU, gregexpr("(?<=-)[^-]+$", SKU, perl = TRUE)) |> unlist()`. It is still not quite as fast as `sub()`, although these timings are more for my own interest than because it's important to optimise operations such as these for speed. – SamR Feb 16 '23 at 18:46
  • Thanks! What part in your code represents the delimiter "-" (dash)? Thanks again. – SilverSpringbb Feb 16 '23 at 20:39
  • The delimiter is included in the positive look-behind `(?<=-)`, which impements an instruction to match only if the *preceding* character is `-` – Chris Ruehlemann Feb 17 '23 at 07:47
  • 1
    @SamR ah, that old regmatches-cum-gregexpr combo - not surprising the `str_extract` solution from the `stringr` package is faster than that long-winded `base R` method. `stringr`, a package for all things regex which is part of the `tidyverse`, is wildly superior to any `base R`regex functions - check it out! – Chris Ruehlemann Feb 17 '23 at 07:56