0

I expect the following code to recognize all the patterns (pat1, pat2, pat3, pat4). However, this recognize only the 'pat4'. How should I modify my code?

test.pat<-"pat1|pat2|pat3|((?<=abcd)(.|\n)*pat4)"

originalTXT<-"start abcd pat1 pat2 pat3 pat4"

gregexpr(test.pat, originalTXT, perl=TRUE)

[[1]]

[1] 11

attr(,"match.length")

[1] 20

attr(,"useBytes")

[1] TRUE

attr(,"capture.start")

[1,] 11 26

attr(,"capture.length")

[1,] 20 1

attr(,"capture.names")

[1] "" ""

If I omit the "pat4" which used 'positive lookbehind', there's no problem.

test.pat<-"pat1|pat2|pat3" 

originalTXT<-"start abcd pat1 pat2 pat3 pat4"

gregexpr(test.pat, originalTXT, perl=TRUE)

[[1]]

[1] 12 17 22

attr(,"match.length")

[1] 4 4 4

attr(,"useBytes")

[1] TRUE
R. Schifini
  • 9,085
  • 2
  • 26
  • 32
  • You expecting to be returned overlapping matches (pos 12, length 3; and pos 11, length 20), but the function doesn't do that. It looks for subsequent matches no earlier than the end of the previous one. – ikegami Mar 24 '18 at 04:00
  • Not sure I know what the error is but I do suspect that `\n` is unlikely to match anything useful. In R regex patterns, backslashes are almost never single. – IRTFM Mar 24 '18 at 04:01
  • @42-, That would be an issue for `\s`, but not for `\n` (since both a literal line feed and `\n` match a line feed). – ikegami Mar 24 '18 at 04:03
  • Color me surprised. All three of "\n", "\\n", and "\\\n" as patterns match a "\n" (even without perl=TRUE). – IRTFM Mar 24 '18 at 04:09
  • @42-, The last one works because you can safely escape non-word characters in a regex pattern, and a line feed is not a word character. A line feed doesn't require escaping since it has no special meaning to the regex engine, which is why the first one works. Finally, the two char sequence by the middle literal is recognized by the regex engine to mean "match a line feed", so that also works. – ikegami Mar 24 '18 at 04:28
  • But I thought patterns needed to go first through the R parsing engine (which would use the first backslash as an escape char to make the second backslash a "real" backslash) before the escaped-n got sent to the regex code. – IRTFM Mar 24 '18 at 04:38
  • @ikegami, Thank you so much for you advice. I found [link](https://stackoverflow.com/questions/7878992/finding-the-indexes-of-multiple-overlapping-matching-substrings/7879329#7879329) & [link](https://stackoverflow.com/questions/31609721/r-str-count-with-overlapping-substrings), which solved the problem. – YongSeo Koo Mar 24 '18 at 04:57
  • Although this is a duplicate question, I did not delete this question because of the meaningful discussion between ikegami and 42-. If I should delete this question, please tell me. I solved this problem by using the following code: test.pat<-"pat1|pat2|pat3|(?=(?<=abcd)(.|\n)*pat4)". – YongSeo Koo Mar 24 '18 at 05:11

0 Answers0