0

I'm playing around with websites and regular expressions in C#. I have this situation:

             <a href="path/to/image">
    <img src="thumbnail"></a>

That outlining is how my application gets the content of a given web site. Tabs and breaklines not the same for each row.

I use gskinner to check the regex (http://gskinner.com/RegExr/) and I have created this regular expression:

            (?i)<a([^>]+)>\W.*</a>

Flags: Multiline

Gskinner shows that the pattern is correct. But when I put in c# (regEx.Matches(...)) it can not find the matches anymore.

Does anyone have any clue how to do this?

Thanks

Jean-Rémy Revy
  • 5,607
  • 3
  • 39
  • 65
  • 5
    Don't to it with Regex. See for example http://stackoverflow.com/q/590747/390819. One of the right tools for parsing HTML is http://htmlagilitypack.codeplex.com/ – Cristian Lupascu May 16 '12 at 20:42

1 Answers1

0

using HtmlAgilityPack and your sample string

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

-

var href = doc.DocumentNode
    .Descendants("a")
    .Select(n => n.Attributes["href"].Value)
    .FirstOrDefault();

var src = doc.DocumentNode
    .Descendants("img")
    .Select(n => n.Attributes["src"].Value)
    .FirstOrDefault();
L.B
  • 114,136
  • 19
  • 178
  • 224
  • Okay, cool. I tried the HtmlAgilityPack, but when I replace FirstOrDefault() with ToList() I get an object reference not set exception. I want all links in the page, not one. How to do this? –  May 17 '12 at 06:51
  • You can add a `.Where(n => n.Attributes["someattr"]!=null)` before `Select` to make sure attribute is not null – L.B May 17 '12 at 07:00