1

I have some texts that may contain links like this:

<a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, <a rel="nofollow" target="_blank" href="http://loremipsum.net/">http://loremipsum.net/</a> sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

I want to find links (a tags) inside this text, what is the regex pattern for that?

This pattern doesn't work:

const string UrlPattern = @"(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?";
var urlMatches = Regex.Matches(text, UrlPattern);

thanks

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
user3293835
  • 829
  • 2
  • 15
  • 30

3 Answers3

1

I suggest to use HtmlAgilityPack for parsing HTML (its available from NuGet):

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNode.SelectNodes("//a[@href]")
               .Select(a => a.Attributes["href"].Value);

Result:

[
  "http://loremipsum.net/",
  "http://loremipsum.net/"
]

Suggested reading: Parsing Html The Cthulhu Way

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
0

maybe so

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;

for list of matches

StringCollection resultList = new StringCollection();

Regex regexObj = new Regex(@"<a.+?href=(['|""])(.+?)\1");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[2].Value);
    matchResult = matchResult.NextMatch();
} 
Vasiliy vvscode Vanchuk
  • 7,007
  • 2
  • 21
  • 44
0

You should use an XML parser which is much more robust and reliable in this kind of task. But if you want something very quick and very dirty, here it is:

<a.*?<\/a>

If this is too simple, and you need to capture the link address or the link content, go with this:

<a.*?href="(?<address>.*?)".*?>(?<content>.*?)<\/a>

They both don't match correctly nested tags.

BlackBear
  • 22,411
  • 10
  • 48
  • 86