Extract a string of words between two specific words in R

Question

I have the following string : "PRODUCT colgate good but not goodOKAY"

I want to extract all the words between PRODUCT and OKAY

score 39 · Answer 1 · answered Feb 01 '15 at 22:45

39

This can be done with sub:

s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)

giving:

[1] "colgate good but not good"

No packages are needed.

Here is a visualization of the regular expression:

.*PRODUCT *(.*?) *OKAY.*

Regular expression visualization

Debuggex Demo

answered Feb 01 '15 at 22:45

G. Grothendieck

254,981
17
203
341

@g-grothendieck What I could if I wanted to extract the words between "colgate good" and "good"??? I'd expected " but not ", instead it returns "" if i change 'PRODUCT ' and 'OKAY' values straightfoward to 'colgate good' and 'good'. – Yuri Santos Jun 16 '19 at 22:58

Gregor Thomas · Answer 2 · 2020-10-15T13:54:41.393

23

x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")

(?<=PRODUCT) -- look behind the match for PRODUCT

.* match everything except new lines.

(?=OKAY) -- look ahead to match OKAY.

I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.

(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)

edited Oct 15 '20 at 13:54

answered Feb 01 '15 at 20:30

Gregor Thomas

136,190
20
167
294

this doesn't work for me: 'could not find function "perl" ' not sure if a package is missing or something – Simon C. Oct 15 '20 at 09:46
1

@SimonC. the `stringr` package updated a bit ago to no longer need or use the `perl` function, so they removed it from the package. I just removed from the answer too so it works with the current version of `stringr`. – Gregor Thomas Oct 15 '20 at 13:53

score 17 · Answer 3 · answered Feb 01 '15 at 20:26

17

You can use gsub:

vec <- "PRODUCT colgate good but not goodOKAY"

gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"

answered Feb 01 '15 at 20:26

Sven Hohenstein

80,497
17
145
168

score 13 · Answer 4 · edited Feb 02 '15 at 06:10

13

You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:

x <- "PRODUCT colgate good but not goodOKAY"

library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)

## [[1]]
## [1] "colgate good but not good"

edited Feb 02 '15 at 06:10

Gregor Thomas

136,190
20
167
294

answered Feb 02 '15 at 03:39

Tyler Rinker

108,132
65
322
519

score 3 · Answer 5 · answered Oct 08 '19 at 17:13

3

You could use the package unglue :

library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

answered Oct 08 '19 at 17:13

moodymudskipper

46,417
11
121
167

Extract a string of words between two specific words in R

5 Answers5

Linked

Related