Regex - Getting the shortest text containing a given token

Question

can this be done using only one regular expression?

Edit: Please, don't complain about me parsing HTML :) The same situation can be reproduced with plain text :

Supposed source string:

Lorem 1 ipsum. Lorem 2 ipsum TOKEN 
foo. Lorem 3 ipsum

Supposed source string HTML version:

<div id="entry-1">Lorem ipsum</div>
<div id="entry-2">Lorem ipsum TOKEN</div>
<div id="entry-3">Lorem ipsum</div>

What I want to get:

2, because that "Lorem ipsum" contains the token.

I'm trying it using: /([0-9]+).*TOKEN/sm, but I get 1, because it starts looking for TOKEN after finding the first "digit", that is 1.

Using two separated regex/preg_match it's easy, but I wonder if this approach could be improved.

Thanks in advance for your help :)

Regexes + html = [Tony the Pony](http://stackoverflow.com/a/1732454/118068) will come and clip-clop all over your face. — Marc B, Dec 13 '11 at 15:59

score 2 · Answer 1 · answered Dec 13 '11 at 15:57

2

Try the non-greedy *

/entry-([0-9]+).*?TOKEN/sm

Doesn't work on all platforms, but it might work in that (is that javascript?)

answered Dec 13 '11 at 15:57

Patrick

1,766
1
15
27

As far as I could understood and could test the non-greedy behaviour would avoid finding a longer string containing 2 tokens, it will stop after finding the first one. It's PHP btw. Thanks – John Smith Dec 13 '11 at 16:06

score 0 · Answer 2 · answered Dec 13 '11 at 16:03

0

I'd use a positive lookbehind to make sure that you match TOKEN, like so:

<div id="entry-([0-9]+)">.*(?<=TOKEN)</div>

You can use it like this:

$result = preg_match('%<div id="entry-([0-9]+)">.*(?<=TOKEN)</div>%i', $subject, $matches);

This will match the second example, but not the first or third.

answered Dec 13 '11 at 16:03

nickb

59,313
13
108
143

It only works if the text containing the token is single line, even if I add the ms modifiers :( See it here: http://www.ideone.com/VyO6n – John Smith Dec 13 '11 at 16:35

score 0 · Answer 3 · answered Dec 13 '11 at 16:03

0

Your regex is correct, but the problem is with the s modifier which causes . to match newline too and this makes your regex match the 1. Drop the s.

Also you don't need the m modifier as you are not using anchors in your regex.

See it

This answer assumes that the entry-[0-9] and the TOKEN are on the same line in the input.

answered Dec 13 '11 at 16:03

codaddict

445,704
82
492
529

I need the s and m modifiers because the text containing the TOKEN could have several lines :( Like here: http://www.ideone.com/KryNE Thanks for that link, very useful. – John Smith Dec 13 '11 at 16:13

Regex - Getting the shortest text containing a given token

3 Answers3