You may use
> x <- c("1. Stack is great and awesome", "2. Stack")
> regmatches(x, regexpr("[A-Za-z]+(?:\\s+[A-Za-z]+){0,2}", x))
[1] "Stack is great" "Stack"
## Or to support all Unicode letters
> y <- c("1. Stąck is great and awesome", "2. Stack")
> regmatches(y, regexpr("\\p{L}+(?:\\s+\\p{L}+){0,2}", y, perl=TRUE))
[1] "Stąck is great" "Stack"
## In some R environments, it makes sense to use another, TRE, regex:
> regmatches(y, regexpr("[[:alpha:]]+(?:[[:space:]]+[[:alpha:]]+){0,2}", x))
[1] "Stąck is great" "Stack"
See the regex demo and the online R demo and an alternative regex demo.
Note that the regex will extract the first chunk of 1, 2 or 3 letter words from any string. If you need at least 2 words, replace {0,2}
limiting quantifier with {1,2}
one.
To extract multiple matches, use gregexpr
rather than regexpr
.
Pattern details
\\p{L}+
/ [A-Za-z]
- any 1+ Unicode (or ASCII if [A-Za-z]
is used) letters
(?:\\s+\\p{L}+){0,2}
/ (?:\\s+[a-zA-Z]+){0,2}
- 0, 1 or 2 consecutive occurrences of:
\\s+
- 1+ whitespaces
\\p{L}+
/ [A-Za-z]
- any 1+ Unicode (or ASCII if [A-Za-z]
is used) letters
Mind using the perl=TRUE
argument with the regex that uses \p{L}
construct. If it does not work, try adding the (*UCP)
PCRE verb at the very beginning of the pattern that makes the all generic/Unicode/shorthand classes really Unicode aware.
Note that all these regexps will work with stringr::str_extract
and stringr::str_extract_all
:
> str_extract(x, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stack is great" "Stack"
> str_extract(x, "[a-zA-Z]+(?:\\s+[a-zA-Z]+){0,2}")
[1] "Stack is great" "Stack"
> str_extract(x, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stack is great" "Stack"
There is no support for (*UCP)
here as stringr
functions are ICU regex powered, not PCRE. Unicode test:
> str_extract(y, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stąck iç great" "Stack"
> str_extract(y, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stąck iç great" "Stack"