2

I would like to extract cat and dog in any order

string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"

What I have now extracts cat and dog, but also the text in-between

stringr::str_extract(string1, "cat.*dog|dog.*cat"

I would like the output to be

cat dog

and

dog cat

for string1 and string2, respectively

matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
  • 1
    Are you sure it should be `dog cat` for both? I can get `cat dog` for string1 and `dog cat` for string2. Or do you want to get `dog` for string1 and `cat` for string2? – Wiktor Stribiżew Feb 02 '18 at 21:47
  • Hi Wiktor, yes, that's what I meant. Thanks for the clarification. Will edit OP accordingly – matsuo_basho Feb 02 '18 at 21:50
  • Please see my update. I have changed the function from `str_extract` to `str_extract_all` to capture all the groups. – www Feb 02 '18 at 22:00

3 Answers3

3

You may use sub with the following PCRE regex:

.*(?|(dog).*(cat)|(cat).*(dog)).*

See the regex demo.

Details

  • .* - any 0+ chars other than line break chars (to match all chars add (?s) at the pattern start)
  • (?|(dog).*(cat)|(cat).*(dog)) - a branch reset group (?|...|...) matching either of the two alternatives:
    • (dog).*(cat) - Group 1 capturing dog, then any 0+ chars as many as possible, and Group 2 capturing cat
    • | - or
    • (cat).*(dog) - Group 1 capturing cat, then any 0+ chars as many as possible, and Group 2 capturing dog (in a branch reset group, group IDs reset to the value before the group + 1)
  • .* - any 0+ chars other than line break chars

The \1 \2 replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog or cat, a space, and a cat or dog).

See an R demo online, too:

x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"

To return NA in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn to apply custom replacement logic:

> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"

Here,

  • ^ - start of the string anchor
  • (?:.*((dog).*(cat)|(cat).*(dog)).*|.*) - a non-capturing group that matches either of the two alternatives: .*((dog).*(cat)|(cat).*(dog)).*:
    • .* - any 0+ chars as many as possible
    • ((dog).*(cat)|(cat).*(dog)) - a capturing group matching either of the two alternatives:
      • (dog).*(cat) - dog (Group 2, assigned to a variable), any 0+ chars as many as possible, and then cat (Group 3, assigned to b variable)
      • |
      • (cat).*(dog) - dog (Group 4, assigned to y variable), any 0+ chars as many as possible, and then cat (Group 5, assigned to z variable)
    • .* - any 0+ chars as many as possible
      • | - or
      • .* - any 0+ chars
  • $ - end of the string anchor.

The x in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Wiktor, thank you very much. Alas, I found something. If my search string requirements are now "dog" and "giraffe", a test string of "aasdfadsf cat asdfadsf dog" will return "aasdfadsf cat asdfadsf dog", whereas I want it to return NA https://regex101.com/r/fnkDLg/1/ – matsuo_basho Feb 02 '18 at 23:21
  • Thank you. Can you please explain the regex syntax, especially the "(?:". Also, what is the purpose of the i argument in the function? – matsuo_basho Feb 03 '18 at 01:49
  • `strapply` might also be useful here. – G. Grothendieck Feb 03 '18 at 12:47
2

We can use str_extract_all from the package with capture groups.

string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"

library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
# 
# [[2]]
# [1] "dog" "cat"
# 
# [[3]]
# character(0)

We can also set simplify = TRUE. The output would be a matrix.

str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
#       [,1]  [,2] 
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] ""    ""  
www
  • 38,575
  • 12
  • 48
  • 84
  • I wish this worked for what I want because it is so elegant. However, if BOTH cat and dog aren't present, I want to return NA – matsuo_basho Feb 02 '18 at 22:09
  • @matsuo_basho Please see my update. When there are no cat and dog, the function returns `character(0)` or `""` depends on `simplify = TRUE`. You may want to replace them with `NA` later. – www Feb 02 '18 at 22:18
  • I mean that I would like NA returned for the following strings: "adsfadsf dog asfdadsf", "asdfadsf cat asdfadsf" – matsuo_basho Feb 02 '18 at 22:20
  • I think what you were trying to say is “Either dog or cat is not present”. Sorry I am not an native English speaker, but I don’t think you provide sufficient information about what was your requirements. Since my output fits your desired output in your post, and you have found the answer you want, I will not modify my post. Cheers. – www Feb 02 '18 at 23:10
1

Or,

> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"

> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"
Brian Davis
  • 990
  • 5
  • 11