Regular expressions: Multiline-issue with html

Question

I'm playing around with websites and regular expressions in C#. I have this situation:

             <a href="path/to/image">
    <img src="thumbnail"></a>

That outlining is how my application gets the content of a given web site. Tabs and breaklines not the same for each row.

I use gskinner to check the regex (http://gskinner.com/RegExr/) and I have created this regular expression:

            (?i)<a([^>]+)>\W.*</a>

Flags: Multiline

Gskinner shows that the pattern is correct. But when I put in c# (regEx.Matches(...)) it can not find the matches anymore.

Does anyone have any clue how to do this?

Thanks

Don't to it with Regex. See for example http://stackoverflow.com/q/590747/390819. One of the right tools for parsing HTML is http://htmlagilitypack.codeplex.com/ — Cristian Lupascu, May 16 '12 at 20:42

score 0 · Accepted Answer · answered May 16 '12 at 21:15

0

using HtmlAgilityPack and your sample string

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

-

var href = doc.DocumentNode
    .Descendants("a")
    .Select(n => n.Attributes["href"].Value)
    .FirstOrDefault();

var src = doc.DocumentNode
    .Descendants("img")
    .Select(n => n.Attributes["src"].Value)
    .FirstOrDefault();

answered May 16 '12 at 21:15

L.B

114,136
19
178
224

Okay, cool. I tried the HtmlAgilityPack, but when I replace FirstOrDefault() with ToList() I get an object reference not set exception. I want all links in the page, not one. How to do this? – May 17 '12 at 06:51
You can add a `.Where(n => n.Attributes["someattr"]!=null)` before `Select` to make sure attribute is not null – L.B May 17 '12 at 07:00

Regular expressions: Multiline-issue with html

1 Answers1