0

I want to detect URLs and make them link in HTML code. I've searched Stack Overflow but many answers are about detecting and converting links in text strings. When I do that html code will be invalid; ie. img sources will change, etc.

P.S: Close voters: Please read question carefully! It's not duplicate.

For example; the line 1 needs to be converted, and lines 2 & 3 do not.

<!-- Sample html source -->
<div>
   Line 1 : https://www.google.com/
   Line 2 : <a href="https://www.google.com/">https://www.google.com/</a>
   Line 3: <img src="http://a-domain.com/lovely-image.jpg">
</div>

I need to:

  1. Find any URL in html body part

  2. Check if it is clickable or not: If not wrapped by 'a', 'img', '!--', etc..

  3. If not make it clickable: Wrap with 'a'

How can I do that? All C# and JS versions are OK to me.

LATEST UPDATE Changing project build target from 4.7.2 to 4.5 and back to 4.7.2 fixed the "bug".

UPDATE: This is my solution with help of @jira The problem here is nodes won't change at all. I mean the recursive function does the job, replaces links, debugging says, however html document won't update at all. Any modification inside the function doesn't effect outside of the function, I don't know why, InnerText changes - InnerHtml doesn't change

var htmlVersion = "<html><head></head><body>\r\n"
   + "Some text\r\n"
   + "<div>http://google.com</div>\r\n"
   + " Then later more text: http://500px.com\r\n"
   + "<div>Sub <span>abc</span> Back text</div>\r\n"
   + "And the final text"
   + "</body></html>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlVersion);

// Linkify body
var modified = false;
var bodyNode = doc.DocumentNode.SelectSingleNode("//body"); 
var before = bodyNode.InnerHtml;
bodyNode = Linkify(bodyNode);
modified = modified || bodyNode.InnerHtml != before;
// modified is false !!!

The recursive Linkify function:

HtmlAgilityPack.HtmlNode Linkify(HtmlAgilityPack.HtmlNode node)
{
    if (node.Name == "a") // It's already a link
    {
        return node;
    }

    if (node.Name == "#text") // Do replacement here
    {

        // Create links
        // https://stackoverflow.com/a/4750468/627193
        node.InnerHtml = Regex.Replace(node.InnerHtml,
            @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)",
            "<a target='_blank' href='$1'>$1</a>");

    }

    for (int i = 0; i < node.ChildNodes.Count; i++) // Go for child nodes
    {
        node.ChildNodes[i] = Linkify(node.ChildNodes[i]);
    }
    return node;
}
Nime Cloud
  • 6,162
  • 14
  • 43
  • 75

2 Answers2

3

Use html parser like HtmlAgility Pack. Select only text nodes and then search for links in them. That way you won't touch existing links. Depending on how precise you need to be you may use a regex.

For example

var doc = new HtmlDocument();
doc.LoadHtml(html);
Regex r = new Regex(@"(https?://[^\s]+)");
var textNodes = doc.DocumentNode.SelectNodes("//text()");

foreach (var textNode in textNodes) {
    var text = textNode.GetDirectInnerText();
    var withLinks = r.Replace(text, "<a href=\"$1\">$1</a>");
    textNode.InnerHtml = withLinks;
}

Fiddle

Regex to check correctly for links can get quite complicated. Check other answers here on SO.

jira
  • 3,890
  • 3
  • 22
  • 32
0

After changing project build target from 4.7.2 to 4.5 and go back to 4.7.2 again fixed the "bug".

Here is the working code:

var htmlVersion = "<html><head></head><body>\r\n"
   + "Some text\r\n"
   + "<div>http://google.com</div>\r\n"
   + " Then later more text: http://500px.com\r\n"
   + "<div>Sub <span>abc</span> Back text</div>\r\n"
   + "And the final text"
   + "</body></html>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlVersion);

// Linkify body
var modified = false;
var bodyNode = doc.DocumentNode.SelectSingleNode("//body"); 
var before = bodyNode.InnerHtml;
bodyNode = Linkify(bodyNode);
modified = modified || bodyNode.InnerHtml != before;

The recursive Linkify function:

HtmlAgilityPack.HtmlNode Linkify(HtmlAgilityPack.HtmlNode node)
{
    if (node == null || node.Name == "a") // It's already a link
    {
        return node;
    }

    if (node.Name == "#text") // Do replacement here
    {

        // Create links
        // https://stackoverflow.com/a/4750468/627193
        node.InnerHtml = Regex.Replace(node.InnerHtml,
            @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)",
            "<a target='_blank' href='$1'>$1</a>");

    }

    for (int i = 0; i < node.ChildNodes.Count; i++) // Go for child nodes
    {
        node.ChildNodes[i] = Linkify(node.ChildNodes[i]);
    }
    return node;
}
Nime Cloud
  • 6,162
  • 14
  • 43
  • 75