9

How can I use str_match to extract the remaining string after the last substring.

For example, for the string "apples and oranges and bananas with cream", I'd like to extract the remainder of this string after the last occurrence of " and " to return "bananas with cream".

I have tried many alternatives to this command but it either keeps returning the remainder of the string after the first "and" or an empty string.

library(stringr)

str_match("apples and oranges and bananas with cream", "(?<= and ).*(?! and )")
    
    #     [,1]                             
    #[1,] "oranges and bananas with cream"

I've searched StackOverflow for solutions and found some for javascript, Python and base R but have found none for stringr package.

Thanks.

Susie Derkins
  • 2,506
  • 2
  • 13
  • 21
James N
  • 315
  • 2
  • 9

3 Answers3

7

(Don't know about str_match. Base R regex should suffice, though.) Since regex pattern matching is "greedy", i.e. it will search for all of the matches and pick the last one, it's just:

sub("^.+and ", "", "apples and oranges and bananas with cream")
#[1] "bananas with cream"

I'm pretty sure there would be an equivalent in the "lubridate" corner of the hadleyverse.

Then failure with:

 library(lubridate)

Attaching package: ‘lubridate’

The following object is masked from ‘package:plyr’:

    here

The following objects are masked from ‘package:data.table’:

    hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year

The following object is masked from ‘package:base’:

    date

> str_replace("apples and oranges and bananas with cream", "^.+and ", "")
Error in str_replace("apples and oranges and bananas with cream", "^.+and ",  : 
  could not find function "str_replace"

So it's not in pkg:lubridate but rather in stringr (which as I understand it is a very light wrapper around the stringi package):

library(stringr)
 str_replace("apples and oranges and bananas with cream", "^.+and ", "")
[1] "bananas with cream"

I do wish that people who ask questions about non-base package functions would include a library call to give respondents a clue as to their working envirinment.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
3

Another simple approach is to use a variation of the *SKIP what's to avoid schema using capture groups, i.e. What_I_want_to_avoid|(What_I_want_to_match):

library(stringr)
s  <- "apples and oranges and bananas with cream"
str_match(s, "^.+and (.*)")[,2]

The key idea here is to completely disregard the overall matches returned by the regex engine: that's the trash bin. Instead, we only need to check capture group 1 through [,2], which, when set, contains what we are looking for. See also: http://www.rexegg.com/regex-best-trick.html#pseudoregex

We can do a similar thing using base R gsub-functions, e.g.

gsub("^.+and (.*)", "\\1", s, perl = TRUE)

PS: Unfortunately, we cannot use the What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match pattern with stringi/stringr functions since the referenced ICU regex library that does not include the (*SKIP)(*FAIL) verbs (they are only in PCRE available).

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • FYI: base `R` does indeed support `(*SKIP)(*FAIL)` with `perl = TRUE` though I believe it's not really needed here. – Jan May 05 '18 at 06:01
  • 1
    @Jan true but the PS addresses the situation with with stringi/stringr - nor R in general. – wp78de May 05 '18 at 06:06
0

If we need str_match

library(stringr)
str_match("apples and oranges and bananas with cream",   ".*\\band\\s(.*)")[,2]
#[1] "bananas with cream"

Or there is a stri_match_last from stringi

library(stringi)
stri_match("apples and oranges and bananas with cream", 
         regex = ".*\\band\\s(.*)")[,2]
#[1] "bananas with cream"
akrun
  • 874,273
  • 37
  • 540
  • 662