1

I am trying to build a regex for parsing an HTML file and getting all image files. I need to do this in order to embed images before sending it as an e-mail.

Is there a "list of places" where images can be referenced? For example, I know I need to look inside <img src="here" />, or in a CSS style url('here'), or background='here', but does that cover all cases?

And does the regex already exist somewhere? I find writing regexes painful, and I don't want to miss a case, or forget to handle some broken HTML markup.

For <img> tags, I found something like this:

(?<=img\s+src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

but I don't know how to include other places.

Lou
  • 4,244
  • 3
  • 33
  • 72

2 Answers2

4

Don't use regex to parse html, instead use an Html parser like HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var a = doc.DocumentNode.Descendants("img")
            .Select(x => x.Attributes["src"].Value)
            .ToArray();
L.B
  • 114,136
  • 19
  • 178
  • 224
1

Regex tends to be a poor choice for parsing HTML, in particular HTML from different sources.

I suggest using the HTML Agility Pack - a purpose built HTML parser for this.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

An alternative is ScarpySharp, an HtmlAgilityPack extension to select elements using css selectors (like JQuery).

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Thanks, but one place where it won't help is with CSS styles (`url(...)`). Should I parse this part as text only? – Lou Sep 04 '12 at 09:41
  • @Dilbert - A [CSS Parser](http://stackoverflow.com/questions/512720/is-there-a-css-parser-for-c) can be used for that part. – Oded Sep 04 '12 at 09:45