82

I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

I need to extract the string GET_ME which is between STR1 and STR2 (without the white spaces).

I am trying str_extract(a, "STR1 (.+) STR2"), but I am getting the entire match

[1] "STR1 GET_ME STR2"

I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.

Sasha
  • 5,783
  • 8
  • 33
  • 37
  • use [this](https://gist.github.com/MrFlick/10413321) fantastic function `regcapturedmatches(test, gregexpr('STR1 (.+?) STR2', test, perl = TRUE))` – rawr Aug 22 '16 at 19:01

4 Answers4

130

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"
Kim
  • 4,080
  • 2
  • 30
  • 51
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It works! What is the purpose of the question mark? It seems to work without it as well. – Sasha Aug 22 '16 at 18:50
  • 5
    The `?` here is a part of a *lazy* (non-greedy) quantifier. It matches as few characters as possible, while `*` will match as many as possible. So, `STR1 .*? STR2` regex matches `STR1 xx STR2`, and `STR1 .* STR2` will match `STR1 xx STR2 zzz STR2`. If you expect multiple matches in your input, lazy quantifier is a must here. Also, FYI: if the part of string between `STR1` and `STR2` may contain newlines, you need to prepend the pattern with `(?s)`: `"(?s)STR1 (.*?) STR2"`. – Wiktor Stribiżew Aug 22 '16 at 18:51
  • @Wiktor: Can you explain why on earth `str_match` output is in a matrix? It seems so inconvenient, particularly when the only output most people ever want is `[,2]` – Nettle Aug 17 '18 at 02:29
  • 1
    @Nettle I would disagree because if anyone only wants `[,2]`, they should use a mere `regmatches(a, regexpr("STR1\\s*\\K.*?(?=\\s*STR2)", a, perl=TRUE))`. With `stringr`, it is also possible to use a pattern like `str_extract_all(a, "(?s)(?<=STR1\\s{0,1000}).*?(?=\\s*STR2)") ` (though for some reason the space is still included in the match, and it is rather hacky). `str_match` is a life savior when you need to return all matches and captures. Also, the pattern that can be used with `str_match` is much more efficient. – Wiktor Stribiżew Aug 17 '18 at 07:02
  • @Wiktor: `regmatches/regexpr` combo chokes on an expression that's fine in stringr...so your expression `str_extract_all(a, "(?s)(?<=STR1\\s{0,1000}).*?(?=\\s*STR2)")` can't be applied as `regmatches(a,regexpr("(?s)(?<=STR1\\s{0,1000}).*?(?=\\s*STR2)", a, perl = TRUE) ) ` Why is that? – Nettle Aug 17 '18 at 16:18
  • @Nettle *stringr* regex is based on ICU library, the lookbehind patterns may contain limiting quantifiers (like `{0,10}`), and this is not allowed in PCRE patterns (base R `(g)sub` / `(g)regexpr` with `perl=TRUE`). – Wiktor Stribiżew Jan 08 '19 at 08:48
  • 2
    I have written a general [article about extracting strings between two strings with regex](https://www.buymeacoffee.com/wstribizew/extracting-text-two-strings-regular-expressions), too, feel free to read if you have a problem approaching your current similar problem. – Wiktor Stribiżew Feb 06 '21 at 22:03
56

Here's another way by using base R

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

gsub(".*STR1 (.+) STR2.*", "\\1", a)

Output:

[1] "GET_ME"
Ulises Rosas-Puchuri
  • 1,900
  • 10
  • 12
  • 2
    Can you explain the `\\1`? – Giovanni Colitti Apr 28 '22 at 22:28
  • 1
    @GiovanniColitti `( )` components in the pattern are automatically numbered, so `\\1` is telling R to return the first `( )` component (which there is only one of). See what happens when you wrap `STR1` and `STR2` in parentheses and try `\\1`, `\\2`, and `\\3`: `gsub(".*(STR1) (.+) (STR2).*", "\\2", a)` – mikeck Nov 02 '22 at 22:07
  • For my particular case, I had to replace `(.+)` with `*(.*?)` – luchonacho Jan 19 '23 at 14:44
18

Another option is to use qdapRegex::ex_between to extract strings between left and right boundaries

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"

It also works with multiple occurrences

a <- "anything STR1 GET_ME STR2, anything goes here, STR1 again get me STR2"

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"       "again get me"

Or multiple left and right boundaries

a <- "anything STR1 GET_ME STR2, anything goes here, STR4 again get me STR5"
qdapRegex::ex_between(a, c("STR1", "STR4"), c("STR2", "STR5"))[[1]]
#[1] "GET_ME"       "again get me"

First capture is between "STR1" and "STR2" whereas second between "STR4" and "STR5".

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
3

We could use {unglue}, in that case we don't need regex at all :

library(unglue)
unglue::unglue_vec(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{}STR1 {x} STR2{}")
#> [1] "GET_ME"

{} matches anything without keeping it, {x} captures its match (any variable other than x could be used. The syntax"{}STR1 {x} STR2{}" is short for : "{=.*?}STR1 {x=.*?} STR2{=.*?}"

If you wanted to extract the sides too you could do:

unglue::unglue_data(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{left}, STR1 {x} STR2, {right}")
#>                  left      x              right
#> 1  anything goes here GET_ME anything goes here
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • If we want to use any variable instead of STR1 and STR2,how can we. Let's say I assign STR1 to a and STR2 to b, now how can we use regex to extract string between a and b – Nishant Nov 18 '20 at 03:56
  • 1
    Instead of `"{left}, STR1 {x} STR2, {right}"` you could use `sprintf("{left}, %s {x} %s, {right}", a, b)`, or `paste0("{left}, ", a, " {x} ", b, ", {right}")` – moodymudskipper Nov 18 '20 at 04:32