1

I'm trying to write a regex expression to match the src, width and height attributes on an image tag. The width and height are optional.

I have came up with the following:

(?:<img.*)(?<=src=")(?<src>([\w\s://?=&.]*)?)?(?:.*)(?<height>(?<=height=")\d*)?(?:.*)(?<width>(?<=width=")(\d*)?)?

expresso shows this matching only the src bit for the following html snippet

<img src="myimage.jpg" height="20" />
<img src="anotherImage.gif" width="30"/>

I'm hoping I'm really close and someone here can point out what I'm doing wrong, I have a feeling its my optional in between characters bit (?:.*) i've tried making it non greedy to no success. So any pointers?

MJJames
  • 735
  • 6
  • 22
  • 2
    Why do you need to use regex? can you not run it through an HTML parsing library and use XMLReader functions instead ? – duckyflip May 18 '09 at 22:11
  • 1
    Regex syntax is different in different languages. So which language are you using? Perl, ruby, something else? More importantly, consider using a html parser instead of a regex. Do you think a regex will match if the src is after the width and height, rather than before? – dave4420 May 18 '09 at 22:14

4 Answers4

9

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead.

This question has been asked before and will be asked again. Regular Expressions do seem like a good choice for this problem, but they're not.

Community
  • 1
  • 1
David Webb
  • 190,537
  • 57
  • 313
  • 299
  • It was far easier to use a HTML Parser, I used HTMLAgilityPack, so much faster and gives you more control. Many Thanks – MJJames May 19 '09 at 18:06
3

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
1

In most regex dialects, .* is "greedy" and will overmatch; use .*? to match "as little as possible" instead.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
1

I didn't have a chance to test it, but maybe this will work for you (note that I didn't use named matches):

<img(?:(\s*(src|height|width)\s*=\s*"([^"]+)"\s*)+|[^>]+?)*>
Jake McGraw
  • 55,558
  • 10
  • 50
  • 63