3

I am trying to extract URL from an tag, however, instead of getting https://website.com/-id1, I am getting tag link text. Here is my code:

string text="<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a>";

 string parsed = Regex.Replace(text, " <[^>] + href =\"([^\"]+)\"[^>]*>", "$1 " );

    parsed = Regex.Replace(parsed, "<[^>]+>", "");

    Console.WriteLine(parsed);

The result I got was MyLink which is not what I want. I want something like

https://website.com/-id1

Any help or a link will be highly appreciated.

2 Answers2

4

Regular expressions can be used in very specific, simple cases with HTML. For example, if the text contains only a single tag, you can use "href\\s*=\\s*\"(?<url>.*?)\"" to extract the URL, eg:

var url=Regex.Match(text,"href\\s*=\\s*\"(?<url>.*?)\"").Groups["url"].Value;

This pattern will return :

https://website.com/-id1

This regex doesn't do anything fancy. It looks for href= with possible whitespace and then captures anything between the first double quote and the next in a non-greedy manner (.*?). This is captured in the named group url.

Anything more fancy and things get very complex. For example, supporting both single and double quotes would require special handling to avoid starting on a single and ending on a double quote. The string could multiple <a> tags that used both types of quotes.

For complex parsing it would be better to use a library like AngleSharp or HtmlAgilityPack

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • I want to second the recommendation of [HtmlAgilityPack](https://www.nuget.org/packages/HtmlAgilityPack). As [the most famous SO answer of all time](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) will tell you, mixing regular expressions and HTML is generally a bad idea. Stick to parsing libraries that give you something object-oriented to play around with. – pymaxion Feb 08 '17 at 16:52
  • Thank you @Panagiotis; both for the answer and good insight. I will look into them. –  Feb 08 '17 at 20:16
1

Try this:

var input = "<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a><a style=\"font - weight: bold; \" href=\"https://website.com/-id2\">MyLink2</a>";
var r = new Regex("<a.*?href=\"(.*?)\".*?>");
var output = r.Matches(input);
var urls = new List<string>();
foreach (var item in output) {
    urls.Add((item as Match).Groups[1].Value);
}

It will find all a tags and extract their href values then store it in urls List.

Explanation

<a match begining of <a> tag
.*?href= match anything until href=
"(.*?)"match and capture anything inside ""
.*?> match end of <a> tag

Maciej Kozieja
  • 1,812
  • 1
  • 13
  • 32
  • Thank. A nice one with a good insight. Can you please give me a positive vote to my question so that I can give positive vote to answers. Right now, I have only 13 reputation, I need 2 more. Thanks in advance. –  Feb 13 '17 at 06:50
  • Thanks, highly appreciated :) –  Feb 13 '17 at 15:39