1

I'm looking for a nice way to do the following:

I have an article which has HTML tags in it like anchors and paragraphs and so on.
I also have keyword which i need to find in the article and set it as anchor (I have some url to set there).
If the keyword does exist in the article it should then match the following TWO conditions BEFORE making it an anchor:

  1. It can not be inside any tag. For example, something like

    <img alt="keyword"> 
    

    will not be valid/matched.

  2. The keyword can't already be inside anchor. For example, somthing like

    <a>keyword</a>
    

    will not be valid/matched.


    Any help would be appreciated. Thanks

YanivHer
  • 123
  • 14
  • Please explain where you're trying to create this limitation. In a JavaScript function? In a web framework? – isherwood Jan 29 '13 at 14:54
  • I'm trying to do this in C# – YanivHer Jan 29 '13 at 15:07
  • No, it was my bad. I'v add that to the title after you mentioned it :) – YanivHer Jan 29 '13 at 15:11
  • I'm unclear of what you're trying to do. Just trying to clarify. You need to put in a link, but you can't use an `` tag? – EJC Jan 29 '13 at 15:34
  • No, i need to put in a link around some existing word in the article. But, i have to make sure that i won't make that word a link when its part of an attribute of some element or is already a link (inside anchor element). Hope that clarifying things a little more. – YanivHer Jan 29 '13 at 15:55
  • I have arranged and changed the question above to make it more clear. Hope it helps the helpers. – YanivHer Jan 29 '13 at 16:05
  • Ah, I see. Hmm that's a tough one... Where is the article coming from? User input? You could create you're own markup sort of like this site uses. Or you can look for the previous and next char in the string and make sure it's a blank space or a period or a semi colon. (Use a regex) But that seems a little fragile. Hmmm... – EJC Jan 29 '13 at 16:23
  • My first thought was to do like a seeker. Like when you load a file, and go char by char and look for '<' and check if the next one is 'a', if it does then look ahead for it's closing tag. But i thought why not ask for a better solution first :) By the way, the article is a user input. it comes from a rich text control. – YanivHer Jan 29 '13 at 16:34
  • Oh I see what you're doing. You linking words in the users input based on some criteria YOU have. I'm not sure of a better way. I'm trying to think, I don't think you'd want to seek through every char, but I'm not sure what else to do. – EJC Jan 29 '13 at 16:56

1 Answers1

1

I have managed to get it done!

Very much thanks to this post which helped me a lot with the xpath expression: http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/beae72d6-844f-4a9b-ad56-82869d685037/

My task was to add X keywords to the article using table of keywords and urls on my database.
Once keyword was matched - it won't search for it again, but will try to find the next keyword in the text.
The 'keyword' could have been made of more than one word. That's why i added the Replace(" ", "\s+").
Also, i had to give precedence to the longest keywords first. That is if i had:
"good day" and "good" as two different keywords - "good day" always wins.

This is my solution:

static public string AddLinksToArticle(string article, int linksToAdd)
    {
        try
        {
            //load keywords and urls
            var dt = new DAL().GetArticleLinks();

            //sort the it
            IEnumerable<ArticlesRow> sortedArticles = dt.OrderBy(row => row.keyword, new StringLengthComparer());

            // iterate the dictionary to get keyword to replace with anchor
            foreach (var item in sortedArticles)
            {
                article = FindAndReplaceKeywordWithAnchor(article, item.keyword, item.url, ref linksToAdd);
                if (linksToAdd == 0)
                {
                    break;
                }
            }

            return article;
        }
        catch (Exception ex)
        {
            Utils.LogErrorAdmin(ex);
            return null;
        }
    }

    private static string FindAndReplaceKeywordWithAnchor(string article, string keyword, string url, ref int linksToAdd)
    {
        //convert text to html
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(article);

        // \w* - means it can start with any alphanumeric charactar
        // \s+ - was placed to replace all white spaces (when there is more than one word).
        // \b - set bounderies for the keyword
        string pattern = @"\b" + keyword.Trim().Insert(0, "\\w*").Replace(" ", "\\s+") + @"\b";

        //get all elements text propery except for anchor element 
        var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlAgilityPack.HtmlNodeCollection(null);
        foreach (var node in nodes)
        {
            if (node.InnerHtml.Contains(keyword))
            {
                Regex regex = new Regex(pattern);
                node.InnerHtml = regex.Replace(node.InnerHtml, "<a href=\"" + url + "\">" + keyword + "</a>", 1);//match only first occurrence
                linksToAdd--;
                break;
            }
        }

        return doc.DocumentNode.OuterHtml;
    }
}

public class StringLengthComparer : IComparer<string>
{
    public int Compare(string x, string y)
    {
        return y.Length.CompareTo(x.Length);
    }
}

Hope it will help someone in the future.

YanivHer
  • 123
  • 14
  • This code won't work properly because there is a critical defect in the FindAndReplaceKeywordWithAnchor method. node.InnerHtml.Contains(keyword) will return true if your tag is in the text even as part of some word. So if you need "son" but there is a "Jason" it will return true. And then your break will cancel the loop so "son" won't be found. You need to change node.InnerHtml.Contains(keyword) to regex.IsMatch(node.InnerHtml) where regex is new Regex(pattern). – Zoltan Kochan May 06 '13 at 11:04
  • You are right. I have fixed it. Thanks! I won't edit my answer so future people will see your contribution to it. – YanivHer May 09 '13 at 12:17