1

What is a regex to find the first image in an image tag in an HTML document? My previous tries have not really worked, as they just matched based on .jpg" and didn't put into account edge cases such as having an image with a cachebuster timestamp at the end (asdf.jpg?581291823)

Edit: I'm using Node.js. I'd like to do HTML parsing, but we have a lot of documents to parse, so I'm not sure if HTML parsing is the best option as it takes considerably more time.

Filo Stacks
  • 1,951
  • 2
  • 20
  • 20
  • 5
    Use a DOM Parser instead of unreliable HTML parsing with regex. Which language are you using? Provide a sample input and output as well to get better answers. – anubhava Jul 07 '11 at 15:41

3 Answers3

3

This is a perfect example of a task that is tricky and unreliable with regex, and almost trivially easy with an HTML parser. Use a parser for this, not regex.

You haven't said which language you're using, but I've heard some very good things about Beautiful Soup, HTML Purifier, and the HTML Agility Pack, which use Python, PHP, and .NET, respectively. Trust me--save yourself some pain and use those instead.

Edit: If you must use a regex, go with @ridgerunner's pattern.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
3

As anubhava correctly points out, regex is not 100% reliable for parsing HTML. However, for one-shot-tasks, (i.e. not production code), a regex solution can do a pretty good job (and is quite fast as well):

Capture the image URL filename (sans query or fragment) from the first IMG element into group $1:

<img\b[^>]+?src\s*=\s*['"]?([^\s'"?#>]+)

Note that there are certainly edge cases where this does not work.

Edit: Added ">" to the negated SRC attribute value character class.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Might want to deal with unquoted URLs as well, if you can figure out a good way to handle them. Your pattern will handle `` just fine, but `` or `` will break it. I can't think of a simple way to deal with this without breaking something else. Anyway, +1 for mentioning parsers and for a decent solution where no pattern will be perfect. – Justin Morgan - On strike Jul 07 '11 at 20:49
  • Update: Having thought about it, `]+?src\s*=\s*['"]?([^\s'"?#>]+)(?<!/)` should at least deal with that one problem. – Justin Morgan - On strike Jul 07 '11 at 20:54
  • 2
    @Justin Morgan: Good points. I've added the `>` to the char class to handle one of the cases you mention. Since the OP did not specify a language, I left out using any look behind (so this would work with Javascript). I think the `` case is highly unlikely because the empty-element-closing-slash will only be found in XHTML which will typically have quoted attribute values. Thanks for the heads up (and your attention to detail). – ridgerunner Jul 08 '11 at 02:47
1

Scraping html, a simple and very loose regex would be: /\<img.*?src="(.*?)"/

Using a real DOM parser is of course the preferred method.

MichaelP
  • 154
  • 4