27

I have the following string : "PRODUCT colgate good but not goodOKAY"

I want to extract all the words between PRODUCT and OKAY

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
gyaanseeker
  • 371
  • 1
  • 3
  • 3

5 Answers5

39

This can be done with sub:

s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)

giving:

[1] "colgate good but not good"

No packages are needed.

Here is a visualization of the regular expression:

.*PRODUCT *(.*?) *OKAY.*

Regular expression visualization

Debuggex Demo

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • @g-grothendieck What I could if I wanted to extract the words between "colgate good" and "good"??? I'd expected " but not ", instead it returns "" if i change 'PRODUCT ' and 'OKAY' values straightfoward to 'colgate good' and 'good'. – Yuri Santos Jun 16 '19 at 22:58
23
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")

(?<=PRODUCT) -- look behind the match for PRODUCT

.* match everything except new lines.

(?=OKAY) -- look ahead to match OKAY.

I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.

(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • this doesn't work for me: 'could not find function "perl" ' not sure if a package is missing or something – Simon C. Oct 15 '20 at 09:46
  • 1
    @SimonC. the `stringr` package updated a bit ago to no longer need or use the `perl` function, so they removed it from the package. I just removed from the answer too so it works with the current version of `stringr`. – Gregor Thomas Oct 15 '20 at 13:53
17

You can use gsub:

vec <- "PRODUCT colgate good but not goodOKAY"

gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
13

You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:

x <- "PRODUCT colgate good but not goodOKAY"

library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)

## [[1]]
## [1] "colgate good but not good"
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
3

You could use the package unglue :

library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167