You may use sub
with the following PCRE regex:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo.
Details
.*
- any 0+ chars other than line break chars (to match all chars add (?s)
at the pattern start)
(?|(dog).*(cat)|(cat).*(dog))
- a branch reset group (?|...|...)
matching either of the two alternatives:
(dog).*(cat)
- Group 1 capturing dog
, then any 0+ chars as many as possible, and Group 2 capturing cat
|
- or
(cat).*(dog)
- Group 1 capturing cat
, then any 0+ chars as many as possible, and Group 2 capturing dog
(in a branch reset group, group IDs reset to the value before the group + 1)
.*
- any 0+ chars other than line break chars
The \1 \2
replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog
or cat
, a space, and a cat
or dog
).
See an R demo online, too:
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA
in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn
to apply custom replacement logic:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here,
^
- start of the string anchor
(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)
- a non-capturing group that matches either of the two alternatives:
.*((dog).*(cat)|(cat).*(dog)).*
:
.*
- any 0+ chars as many as possible
((dog).*(cat)|(cat).*(dog))
- a capturing group matching either of the two alternatives:
(dog).*(cat)
- dog
(Group 2, assigned to a
variable), any 0+ chars as many as possible, and then cat
(Group 3, assigned to b
variable)
|
(cat).*(dog)
- dog
(Group 4, assigned to y
variable), any 0+ chars as many as possible, and then cat
(Group 5, assigned to z
variable)
.*
- any 0+ chars as many as possible
$
- end of the string anchor.
The x
in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar
, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA
.