Can't match regular expression

Question

I have a file with lots of repeated blocks like this

<li>
<span>תמונה מאירוע</span>
<a href="images/gallerys/events/big/109.jpg"
title="תמונה מאירוע"><img
src="images/gallerys/events/thumbnails/109.jpg" alt="cars" />
</a>
</li>

I want to find pairs of texts of image url and thumb url My pattern is:

href='(.*)'(.*)title(.*)src='(.*?)'

The problem is returned to me the text from the first href to the last src

Make your capturing groups non-greedy: `(.*)` -> `(.*?)`. Just parse the HTML. It'll be easier. — Blender, Feb 21 '13 at 21:13

Karoly Horvath · Answer 1 · 2013-02-21T21:38:37.360

There's no ' in your example. The title attribute is in this case in a new line, so you'll never be able to match that properly. These are just a couple of examples but there are many more to deal with, and it's impossible to do it right with pure regexp.

Whatever language you use (except perhaps the bash/sed/awk... family) it will support parsing the HTML into a DOM tree, and with that you can easily find the needed nodes.

Note: as others pointed out, one of the problems is that .* is greedy, meaning it will eat us much characters as possible. If you're really stubborn, you can solve this with a non-greedy version .*? or charset matches like [^"']*.

It is possible that SO wrapped the string, however, I think there is a multi-line flag for most regex engines? /m? — AndrewP, Feb 21 '13 at 21:30

MikeM · Accepted Answer · 2013-02-21T21:48:45.310

Javascript implementation

var m,
    pairs = [],
    rex = /<li>[\s\S]*?<a href="([^"]+)"[\s\S]+?<img\s+src="([^"]+)"/g,
    str = '<li>\n' +
          '<span>תמונה מאירוע</span>\n' +
          '<a href="images/gallerys/events/big/109.jpg"\n' +
          'title="תמונה מאירוע"><img\n' +
          'src="images/gallerys/events/thumbnails/109.jpg" alt="cars" />\n' +
          '</a>\n' +
          '</li>';

while ( m = rex.exec( str ) ) {
    pairs.push( [ m[1], m[2] ] );
}

console.log( pairs );

Assumes no quotes within urls.

Using a proper HTML parser would be more reliable.

Can't match regular expression

2 Answers2