0

I have a file with lots of repeated blocks like this

<li>
<span>תמונה מאירוע</span>
<a href="images/gallerys/events/big/109.jpg"
title="תמונה מאירוע"><img
src="images/gallerys/events/thumbnails/109.jpg" alt="cars" />
</a>
</li>

I want to find pairs of texts of image url and thumb url My pattern is:

href='(.*)'(.*)title(.*)src='(.*?)'

The problem is returned to me the text from the first href to the last src

Igor Jerosimić
  • 13,621
  • 6
  • 44
  • 53

2 Answers2

1

There's no ' in your example. The title attribute is in this case in a new line, so you'll never be able to match that properly. These are just a couple of examples but there are many more to deal with, and it's impossible to do it right with pure regexp.

Whatever language you use (except perhaps the bash/sed/awk... family) it will support parsing the HTML into a DOM tree, and with that you can easily find the needed nodes.

Note: as others pointed out, one of the problems is that .* is greedy, meaning it will eat us much characters as possible. If you're really stubborn, you can solve this with a non-greedy version .*? or charset matches like [^"']*.

Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176
  • It is possible that SO wrapped the string, however, I think there is a multi-line flag for most regex engines? /m? – AndrewP Feb 21 '13 at 21:30
0

Javascript implementation

var m,
    pairs = [],
    rex = /<li>[\s\S]*?<a href="([^"]+)"[\s\S]+?<img\s+src="([^"]+)"/g,
    str = '<li>\n' +
          '<span>תמונה מאירוע</span>\n' +
          '<a href="images/gallerys/events/big/109.jpg"\n' +
          'title="תמונה מאירוע"><img\n' +
          'src="images/gallerys/events/thumbnails/109.jpg" alt="cars" />\n' +
          '</a>\n' +
          '</li>';

while ( m = rex.exec( str ) ) {
    pairs.push( [ m[1], m[2] ] );
}

console.log( pairs );

Assumes no quotes within urls.

Using a proper HTML parser would be more reliable.

MikeM
  • 13,156
  • 2
  • 34
  • 47