regex for 3 consecutive words if there are any

Question

I am looking for regex that extract 3 consecutive words if there are any. For example, if I have 2 strings:

"1. Stack is great and awesome"
"2. Stack"

The result is:

"Stack is great"
"Stack"

This answer doesn't work for me: regex: matching 3 consecutive words

My effort:

(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )

You need [`[A-Za-z]+(?:\s+[A-Za-z]+){0,2}`](https://regex101.com/r/omKcW7/1). But in order to use it correctly, you need appropriate code. Do you need a single match from any string or multiple? You seem to need a full Unicode support. — Wiktor Stribiżew, Jul 18 '18 at 15:39

Wiktor Stribiżew · Accepted Answer · 2018-07-18T16:13:23.343

You may use

> x <- c("1. Stack is great and awesome", "2. Stack")
> regmatches(x, regexpr("[A-Za-z]+(?:\\s+[A-Za-z]+){0,2}", x))
[1] "Stack is great" "Stack"
## Or to support all Unicode letters
> y <- c("1. Stąck is great and awesome", "2. Stack")
> regmatches(y, regexpr("\\p{L}+(?:\\s+\\p{L}+){0,2}", y, perl=TRUE))
[1] "Stąck is great" "Stack"
## In some R environments, it makes sense to use another, TRE, regex:
> regmatches(y, regexpr("[[:alpha:]]+(?:[[:space:]]+[[:alpha:]]+){0,2}", x))
[1] "Stąck is great" "Stack"

See the regex demo and the online R demo and an alternative regex demo.

Note that the regex will extract the first chunk of 1, 2 or 3 letter words from any string. If you need at least 2 words, replace {0,2} limiting quantifier with {1,2} one.

To extract multiple matches, use gregexpr rather than regexpr.

Pattern details

\\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters
(?:\\s+\\p{L}+){0,2} / (?:\\s+[a-zA-Z]+){0,2} - 0, 1 or 2 consecutive occurrences of:
- \\s+ - 1+ whitespaces
- \\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters

Mind using the perl=TRUE argument with the regex that uses \p{L} construct. If it does not work, try adding the (*UCP) PCRE verb at the very beginning of the pattern that makes the all generic/Unicode/shorthand classes really Unicode aware.

Note that all these regexps will work with stringr::str_extract and stringr::str_extract_all:

> str_extract(x, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[a-zA-Z]+(?:\\s+[a-zA-Z]+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stack is great" "Stack"

There is no support for (*UCP) here as stringr functions are ICU regex powered, not PCRE. Unicode test:

> str_extract(y, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stąck iç great" "Stack"         
> str_extract(y, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stąck iç great" "Stack"

`(*UCP)` changes only `\d \D \s \S \w \W \b \B`, but `\p{L}` (and other `\p{xx}`) are always the same (with or without `(*UCP)`). You can remove it. — Casimir et Hippolyte, Jul 18 '18 at 16:02
@CasimiretHippolyte If the library was not compiled with the PCRE_UCP flag, isn't it required then? Well, thos R versions I have access to now really work without `(*UCP)`, but I remember older versions that didn't. Ok, let's correct it and leave a note. — Wiktor Stribiżew, Jul 18 '18 at 16:04
The PCRE_UCP flag sets only a default behaviour, but doesn't change the `\p{xx}` classes (nor `\h` or `\v` that already contain unicode characters). — Casimir et Hippolyte, Jul 18 '18 at 16:09
@CasimiretHippolyte Ok, anyway, in my Windows R 3.4.4 (2018-03-15) neither `"\\p{L}+(?:\\s+\\p{L}+){0,2}"` nor `"(*UCP)\\p{L}+(?:\\s+\\p{L}+){0,2}"` work, only an `[[:alpha:]]` based pattern works with Unicode correctly. — Wiktor Stribiżew, Jul 18 '18 at 16:17

regex for 3 consecutive words if there are any

1 Answers1

Linked