grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE TRUE TRUE TRUE FALSE
This seems a little brute-force, but it allows
^(\\S+\\s*)?
zero or one word before
- the literal
this is
(followed by zero or more blank-space), then
- at a minimum,
\\S+
one word (with at least one letter), then
- possibly space-and-a-word
\\s*\\S*
, twice, allowing up to three words
Depending on how you intend to use this, you can extract the words into a single-column or multiple columns, using strcapture
(still base R):
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text,
proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
# w1
# 1 <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA>
strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text,
proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
# w1 w2 w3
# 1 <NA> <NA> <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA> <NA> <NA>
The [,-1,drop=FALSE]
is because we need to (..)
capture the words before "this is"
so that it can be optional, but we don't need to keep them, so I drop them right away. (The drop=FALSE
is because base R data.frame
defaults to reducing a single-column return to a vector.)
Slight improvement (less brute-force), that allows for programmatically determining the number of words to accept.
text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
This doesn't necessarily work with strcapture
, since it does not have a pre-defined number of groups. Namely, it will only capture the last of the words:
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2,
proto = list(ign="",w1=""), perl = TRUE)
# ign w1
# 1 one
# 2 two
# 3 three
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 <NA> <NA>
# 7 <NA> <NA>