1

I am looking for regex that extract 3 consecutive words if there are any. For example, if I have 2 strings:

"1. Stack is great and awesome"
"2. Stack"

The result is:

"Stack is great"
"Stack" 

This answer doesn't work for me: regex: matching 3 consecutive words

My effort:

(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )
Mislav
  • 1,533
  • 16
  • 37
  • 1
    You need [`[A-Za-z]+(?:\s+[A-Za-z]+){0,2}`](https://regex101.com/r/omKcW7/1). But in order to use it correctly, you need appropriate code. Do you need a single match from any string or multiple? You seem to need a full Unicode support. – Wiktor Stribiżew Jul 18 '18 at 15:39
  • What about that other post didn't work for you? – camille Jul 18 '18 at 16:26

1 Answers1

3

You may use

> x <- c("1. Stack is great and awesome", "2. Stack")
> regmatches(x, regexpr("[A-Za-z]+(?:\\s+[A-Za-z]+){0,2}", x))
[1] "Stack is great" "Stack"
## Or to support all Unicode letters
> y <- c("1. Stąck is great and awesome", "2. Stack")
> regmatches(y, regexpr("\\p{L}+(?:\\s+\\p{L}+){0,2}", y, perl=TRUE))
[1] "Stąck is great" "Stack"
## In some R environments, it makes sense to use another, TRE, regex:
> regmatches(y, regexpr("[[:alpha:]]+(?:[[:space:]]+[[:alpha:]]+){0,2}", x))
[1] "Stąck is great" "Stack"

See the regex demo and the online R demo and an alternative regex demo.

Note that the regex will extract the first chunk of 1, 2 or 3 letter words from any string. If you need at least 2 words, replace {0,2} limiting quantifier with {1,2} one.

To extract multiple matches, use gregexpr rather than regexpr.

Pattern details

  • \\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters
  • (?:\\s+\\p{L}+){0,2} / (?:\\s+[a-zA-Z]+){0,2} - 0, 1 or 2 consecutive occurrences of:
    • \\s+ - 1+ whitespaces
    • \\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters

Mind using the perl=TRUE argument with the regex that uses \p{L} construct. If it does not work, try adding the (*UCP) PCRE verb at the very beginning of the pattern that makes the all generic/Unicode/shorthand classes really Unicode aware.

Note that all these regexps will work with stringr::str_extract and stringr::str_extract_all:

> str_extract(x, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[a-zA-Z]+(?:\\s+[a-zA-Z]+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stack is great" "Stack" 

There is no support for (*UCP) here as stringr functions are ICU regex powered, not PCRE. Unicode test:

> str_extract(y, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stąck iç great" "Stack"         
> str_extract(y, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stąck iç great" "Stack"         
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @Mislav OK, I will also add that one to the current two. – Wiktor Stribiżew Jul 18 '18 at 15:48
  • 2
    `(*UCP)` changes only `\d \D \s \S \w \W \b \B`, but `\p{L}` (and other `\p{xx}`) are always the same (with or without `(*UCP)`). You can remove it. – Casimir et Hippolyte Jul 18 '18 at 16:02
  • @CasimiretHippolyte If the library was not compiled with the PCRE_UCP flag, isn't it required then? Well, thos R versions I have access to now really work without `(*UCP)`, but I remember older versions that didn't. Ok, let's correct it and leave a note. – Wiktor Stribiżew Jul 18 '18 at 16:04
  • 2
    The PCRE_UCP flag sets only a default behaviour, but doesn't change the `\p{xx}` classes (nor `\h` or `\v` that already contain unicode characters). – Casimir et Hippolyte Jul 18 '18 at 16:09
  • @CasimiretHippolyte Ok, anyway, in my Windows R 3.4.4 (2018-03-15) neither `"\\p{L}+(?:\\s+\\p{L}+){0,2}"` nor `"(*UCP)\\p{L}+(?:\\s+\\p{L}+){0,2}"` work, only an `[[:alpha:]]` based pattern works with Unicode correctly. – Wiktor Stribiżew Jul 18 '18 at 16:17