I'm trying to match an optional (possibly present) phrase in a sentence:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*(word2)?.*(word3)/'
Output:
1:word1 2: 3:word3
I know the first '.*' is being greedy and matching everything up to 'word3'. Making it non-greedy doesn't help:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*?(word2)?.*(word3)/'
Output:
1:word1 2: 3:word3
There seems to be a conflict of interest here. I would have thought Perl would match (word2)? if possible and still satify the non-greedy .*?. At least that's my understanding of '?'. The Perl regex page says '?' makes 1 or zero times so shouldn't it prefer one match rather than zero?
Even more confusing is if I capture the .*?:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'
Output:
1:word1 2: 3: 4:word3
All groups here are capturing groups so I don't know why they are empty.
Just to make sure the inter-word space isn't being captured:
perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'
Output:
1:word1 2: 3: 4:word3
Given the only match not capturing is the one between word2 and word3 I can only assume that it's the one doing the matching. Sure enough:
perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4 5:$5\n" if m/(word1)(.*?)(word2)?(.*)(word3)/'
Output:
1:word1 2: 3: 4:_word2_ 5:word3
So the greedy matching is working backwards, and Perl is happy to match zero (rather than one) instance of word2. Making it non-greedy doesn't help either.
So my question is: how can I write my regex to match and capture a possible phrase in a sentence? My examples given here are simplistic; the actual sentence I am parsing is much longer with many words between those I am matching, so I can't assume any length or composition of intervening text.
Many thanks, Scott