-2

I want to extract all the URLs in a string that's not between two certain characters. It should not extract the URL if it is between the following characters:

  • " and "
  • "> and <

I have the following string:

Content <strong>http://www.helloworld.com/test</strong> with a hyperlink <a href="www.google.com">www.google.com</a> and also a normal link www.youtube.com dsdsd sometexthttp://www.website.com/test sdfsdfsdfg ssdgsdf sdfsdfsdf

Regex what I currently have is:

(http://|https://|ftp://|mailto:|www\.){1}(?![^>]*<)(?![^"]*")[^^\\\"\n\s\}\{\|\`<>~]*

It will extract:

It should also extract, but it doens't right now:

It doesn't extract www.google.com (which is good)

https://regex101.com/r/UhVZWe/5

Khiem
  • 157
  • 6
  • 18

1 Answers1

0

One of way of doing this would be to use the HTML Agility Pack.

Here is a sample piece of code that extracts all URLs from the inner text of all elements in your HTML fragment. You can modify it to extract only those you need:

var content = "<strong>http://www.helloworld.com/test</strong> with a hyperlink <a href=\"www.google.com\">www.google.com</a> and also a normal link www.youtube.com dsdsd sometext http://www.website.com/test sdfsdfsdfg ssdgsdf sdfsdfsdf";
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(content);                        
Regex regex = new Regex(@"(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+", RegexOptions.Compiled);
FindMatchesInText(document.DocumentNode, regex);       

private void FindMatchesInText(HtmlNode parentNode, Regex regex)
{                        
    foreach (var node in parentNode.ChildNodes)
    {                
        var match = regex.Match(node.InnerText);
        while(match.Success)
        {
            Console.WriteLine(match.Value);
            match = match.NextMatch();
        }
        //Recurse
        FindMatchesInText(node, regex);
    }            
}

Output:

http://www.helloworld.com/test

http://www.helloworld.com/test

www.google.com

www.google.com

www.youtube.com

http://www.website.com/test

JuanR
  • 7,405
  • 1
  • 19
  • 30