0

I have that text :

<a href="/extend/themes/bizway">BizWay</a>

And i want to use regular expression to get the BizWay word only extracted from the inner text of the a tag. And by the way that is a sample a tag , BizWay can be any word

So let`s say i want a REGEX like :

<a href=" + '"' + "/extend/themes/WORD" + '"' + ">WORD</a>

Where WORD = WORD

EDIT :

I have tried the following REGEX Pattern :

@"<a href=" + '"' + "/extend/themes/.*" + '"' + @">.*</a>"

But it gives me the whole line.

I`d really appreciate your help.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
R.Vector
  • 1,669
  • 9
  • 33
  • 41

4 Answers4

4

I'd suggest using an HTML parser library for C# instead of using regex (there's a long argument about it over here from stackoverflow RegEx match open tags except XHTML self-contained tags).

From a quick search, HTMLAgilityPack seems to be a good bet for C#. This stackoverflow post will help in getting this set up in your C# project. How to use HTML Agility pack

Community
  • 1
  • 1
loeschg
  • 29,961
  • 26
  • 97
  • 150
2

I agree wholeheartedly with loeschg. I made the mistake of ignoring this advice and used regular expressions. After about a month of tweaking my code I ended up using HtmlAgilityPack. Parsing Html using regular expression is just not as straight-forward as you would expect, there are too many variables.

Here is a starting point for you...

string rawHtml = "<a href=\"/extend/themes/bizway\">BizWay</a>"

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(rawHtml);
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");
foreach (var node in linkNodes)
{
    string word = node.InnerText;
}

To check the value of the href you can do this...

if (node.Attributes["href"].Value.Contains("extend/themes"))
Gene S
  • 2,735
  • 3
  • 25
  • 35
0

I suspect the problem is not the regex itself but rather your expectation of what it will do. In my experience regex systems return the text that matches the full pattern specified. Your expectation is that it will return only the piece matching the wild cards. Unfortunately that's not how regex works. You still need to parse the results of the regex for the bits of the lines you are interested in.

And for parsing HTML, as loeschg mentions, you're better off with an HTML parsing library.

Corin
  • 2,317
  • 2
  • 32
  • 47
0

You'll want to use a group if you want just a part of the line. You do so by wrapping the part you want to retrieve later in parenthesis, and optionally naming it with something like:

?<name>

So:

Match m = Regex.Match(@"<a href='/extend/themes/bizway'>BizWay</a>", 
                      @"<a href='/extend/themes/(?<word1>.+)'>(?<word2>.+)</a>");
Console.WriteLine(m.Groups["word1"] + " " + m.Groups["word2"]);

Would print "bizway BizWay".

Timothy S.
  • 317
  • 3
  • 12