Specifying a word followed by a specific word followed by max of 3 words in regex in R

Question

I'm looking for a specific regex pattern which I can't seem to get:

cryptically:

pattern <- "[1 word|no word][this is][1-3 words max]"

text <- c("this guy cannot get a mortgage, this is a fake application", "this is a new application", "hi this is a specific question", "this is real", "this is not what you are looking for")

str_match("pattern", text)

The output I'd like to have is:

[1]FALSE  #cause too many words in front
[2]TRUE   
[3]TRUE
[4]TRUE
[5]FALSE  #cause too many words behind it

It should be doable but im struggling with the words and max amount of it in regex Can anyone help me with this one?

Datacrust, the StackExchange tag-recommendation system is okay, but it does provide bad suggestions occasionally. In this case, you allowed it to suggest [tag:python], which is not suggested/supported in the question. Please be more cognizant of tags being used; "more" can garner more attention and therefore more likelihood of getting an answer, but unrelated tags can invite downvotes, close-votes, and/or just negative responses. — r2evans, Dec 22 '20 at 15:43
Additionally, while it might be easy for somebody familiar with the R package ecosystem to **infer** that you are using the `stringr` package, it is not wise for you to rely on that. Please be explicit with non-base R packages. If you have not already visited them, there are a few places to read about how to format reproducible and self-contained questions *well* on SO: Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. Thanks! — r2evans, Dec 22 '20 at 15:48

r2evans · Accepted Answer · 2020-12-22T16:56:31.227

grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE  TRUE  TRUE  TRUE FALSE

This seems a little brute-force, but it allows

^(\\S+\\s*)? zero or one word before
the literal this is (followed by zero or more blank-space), then
at a minimum, \\S+ one word (with at least one letter), then
possibly space-and-a-word \\s*\\S*, twice, allowing up to three words

Depending on how you intend to use this, you can extract the words into a single-column or multiple columns, using strcapture (still base R):

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text, 
           proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
#                    w1
# 1                <NA>
# 2   a new application
# 3 a specific question
# 4                real
# 5                <NA>

strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text, 
           proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
#     w1       w2          w3
# 1 <NA>     <NA>        <NA>
# 2    a      new application
# 3    a specific    question
# 4 real                     
# 5 <NA>     <NA>        <NA>

The [,-1,drop=FALSE] is because we need to (..) capture the words before "this is" so that it can be optional, but we don't need to keep them, so I drop them right away. (The drop=FALSE is because base R data.frame defaults to reducing a single-column return to a vector.)

Slight improvement (less brute-force), that allows for programmatically determining the number of words to accept.

text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

This doesn't necessarily work with strcapture, since it does not have a pre-defined number of groups. Namely, it will only capture the last of the words:

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2, 
           proto = list(ign="",w1=""), perl = TRUE)
#    ign    w1
# 1        one
# 2        two
# 3      three
# 4 <NA>  <NA>
# 5 <NA>  <NA>
# 6 <NA>  <NA>
# 7 <NA>  <NA>

Thanks r2evans for your quick response, it solves my question eventhough it indeed looks a little brute-force! — Datacrust, Dec 22 '20 at 16:10

Specifying a word followed by a specific word followed by max of 3 words in regex in R

1 Answers1