RegEx for extracting HTML Image properties

Question

I need a RegEx pattern for extracting all the properties of an image tag.

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all:

I come up something like:

(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']

Is there any possibilities I'll be missing or a more efficient simple pattern?

EDIT:
Sorry, I will be more specific, I'm doing this using .NET so it's on the server side.
I've already a list of img tags, now I just need to parse the properties.

Ack. And again "it depends" is the answer. You can use regex if you know beforehand what *exactly* you will be working on, you should use a parser if you can't guarantee well-formedness. — Tomalak, Dec 08 '08 at 17:46
[Beware of Zalgo](http://stackoverflow.com/a/1732454/135078) — Kelly S. French, Jan 12 '12 at 22:48

score 5 · Answer 1 · answered Dec 08 '08 at 17:35

5

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.

answered Dec 08 '08 at 17:35

Tomalak

332,285
67
532
628

score 1 · Answer 2 · answered Jan 03 '10 at 06:52

1

Your best bet is to use something like HTML Agility Pack instead of using regex. It's designed to handle a lot of cases and can save you more than a few headaches due to hammering out edge cases

answered Jan 03 '10 at 06:52

James Hollingshead

751
4
8

score 1 · Answer 3 · answered Dec 08 '08 at 17:36

If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.

score 0 · Answer 4 · edited May 23 '17 at 12:13

0

Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:13

Community

1
1

answered Jan 03 '10 at 08:41

ProfK

49,207
121
399
775

score 0 · Answer 5 · answered Jan 03 '10 at 08:57

/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

A match_all on this, will return (format depends on your library, but key indexes are):

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)

score 0 · Answer 6 · answered Dec 08 '08 at 17:36

0

If you want all attribute values, might I suggest using the DOM? Something like element.attributes will work well.

If you insist on a regex //\b\w+="[^"]+"// should get everything.

answered Dec 08 '08 at 17:36

sblundy

60,628
22
121
123

RegEx for extracting HTML Image properties

6 Answers6

Linked

Related