-5

I'm trying to match multiple hrefs in a html page and I can't seem to get it working. When I use my regex, I get no matches. How can I get multiple matches of the entire href breaking them into the two specified groups?

Sample href of many to match:

<a href="/string1/any string here/string2">text here</a>

My regex code:

MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?(\/string2))"">(?<text>.*?)</a>", RegexOptions.Singleline);

This works, but matches hrefs I'm not interested in addition to the ones I need:

MatchCollection m1 = Regex.Matches(result, @"<a\shref=""(?<url>(\/string1\/).*?)"">(?<text>.*?)</a>", RegexOptions.Singleline);
Rexfelis
  • 57
  • 3
  • 10
  • so you need `any string here` – Braj Aug 15 '14 at 14:30
  • 1
    A suggestion for you http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Steve Aug 15 '14 at 14:30
  • What is your question? That's great you've got some regex but where is your code? Show the expected behavior/output and the actual behavior/output. – tnw Aug 15 '14 at 14:31
  • 1
    Sorry, first time to use stackoverflow. I edited my question. – Rexfelis Aug 15 '14 at 14:32
  • @Rexfelis what's the problem with your regex? escape the forward slash in the closing `a` tag. `(\/string1\/).*?(\/string2))"">(?.*?)<\/a>` – Avinash Raj Aug 15 '14 at 14:37
  • 3
    http://htmlagilitypack.codeplex.com/ – L.B Aug 15 '14 at 14:41
  • Implementing this yourself might result in a very fragile solution. The smallest deviation in the format of the input html may break it. You could use a very complex Regex and hope it works 99% of the time or you could simply get the content of the href attribute and split it however you need. If it's guaranteed to be XHTML you can use Linq2XML to build a more robust solution but ultimately you should just pick a existing library that has been tested and proven as @L.B suggested. Regex is powerful but it's easy to shoot yourself in the foot with them. – TimothyP Aug 15 '14 at 14:46
  • Thanks for the info, I'll give HtmlAgilityPack a shot. – Rexfelis Aug 15 '14 at 15:14

2 Answers2

2

As mentioned in comments, use a real html parser like HtmlAgilityPack instead of Regex

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"<a href=""/string1/any string here/string2"">text here</a>");

var links = doc.DocumentNode
                .SelectNodes("//a[@href]")
                .Select(a=>a.Attributes["href"].Value)
                .ToList();

or without xpath

var links = doc.DocumentNode
                .Descendants("a")
                .Where(a=>a.Attributes["href"]!=null)
                .Select(a=>a.Attributes["href"].Value)
                .ToList();
EZI
  • 15,209
  • 2
  • 27
  • 33
  • 1
    Wow, that was easy and I can see that long term using a html parser could be safer. I was able to add some LINQ filters to narrow down to what I need. – Rexfelis Aug 15 '14 at 15:20
1

Use Parentheses for Grouping and Capturing

<a href="(\/string1\/)(.*?)(\/string2)">

Here is regex101 demo


OR try with Character Classes or Character Sets

<a href="(\/string1\/)([^\/]+)(\/string2)">

I don't know why you need string1 and string2 that you already know. You need just any string that is in between

Try without capturing groups.

Read more about Lookahead and Lookbehind Zero-Length Assertions

(?<=<a href="\/string1\/)[^\/]*(?=\/string2">)

Online demo

Braj
  • 46,415
  • 5
  • 60
  • 76