0

I am trying to do a token replacement in a html my untokenised string has multiple <input></input> tags. I want to replace the name attribute with the token <<VS_USER_NAME>> for example. But my regex replaces the all the <input> regardless. Below is a stand alone example.

this is the desired output

<div>username&nbsp;<<VS_USER_NAME>></div><div>&nbsp;</div><div>full name&nbsp;<<VS_USER_FULL_NAME>></div><div>&nbsp;</div><div>password&nbsp;<<VS_USER_PASSWORD>></div><div>&nbsp;</div><div>thanks</div>

Code:

static void Main(string[] args)
    {
        string text = "<div>username&nbsp;<input class=\"VSField\" contenteditable=\"false\" name=\"VS_USER_NAME\" style=\"background-color: rgb(220,220,200);\">[User Name]</input></div><div>&nbsp;</div><div>full name&nbsp;<input class=\"VSField\" contenteditable=\"false\" name=\"VS_USER_FULL_NAME\" style=\"background-color: rgb(220,220,200);\">[Full Name]</input></div><div>&nbsp;</div><div>password&nbsp;<input class=\"VSField\" contenteditable=\"false\" name=\"VS_USER_PASSWORD\" style=\"background-color: rgb(220,220,200);\">[Password]</input></div><div>&nbsp;</div><div>thanks</div>";
        string textTokenised = GetTokenisedText(text, "VS_USER_NAME", "VS_USER_FULL_NAME", "VS_USER_PASSWORD");
    }

private static string GetTokenisedText(string untokenised, params string[] tokenKeys)
    {
        foreach (string tokenKey in tokenKeys)
        {
            string string2 = GetToken(tokenKey);
            string string1 = GetRegex(tokenKey);

            untokenised = Regex.Replace(untokenised, string1, string2);
        }

        return untokenised;
    }


    private static string GetToken(string tokenKey)
    {
        return string.Format("<<{0}>>", tokenKey);
    }


    private static string GetRegex(string tokenKey)
    {
        return string.Format("()<input([^>]*e*)name=\"{0}\"([^>]*e*)>(.*)</input>", tokenKey);            
    }
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
wk4questions
  • 125
  • 1
  • 2
  • 11

2 Answers2

1

Your regex is greedy by default .*.. you have to make it non greedy by adding ?. Use the following:

return string.Format("()<input([^>]*e*)name=\"{0}\"([^>]*e*)>(.*?)</input>", tokenKey); 
                                                                ↑
karthik manchala
  • 13,492
  • 1
  • 31
  • 55
1

Here is an example how you can do the same with HtmlAgilityPack:

private static string GetTokenisedText(string untokenised, params string[] tokenKeys)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(untokenised);
    var query = doc.DocumentNode.Descendants("input");
    foreach (var item in query.ToList())
    {
        var value = item.GetAttributeValue("name", string.Empty);
        if (!string.IsNullOrEmpty(value))
        {
           var token = tokenKeys.Where(p => p == value).FirstOrDefault();
           if (!string.IsNullOrEmpty(token))
           {
               item.NextSibling.Remove();
               var newNode = HtmlAgilityPack.HtmlTextNode.CreateNode(string.Format("{{{{{0}}}}}", token.ToUpper()));
               item.ParentNode.ReplaceChild(newNode, item);
           }
        }
    }
    return doc.DocumentNode.OuterHtml;
}

Output:

<div>username&nbsp;{{VS_USER_NAME}}</div><div>&nbsp;</div><div>full name&nbsp;{{VS_USER_FULL_NAME}}</div><div>&nbsp;</div><div>password&nbsp;{{VS_USER_PASSWORD}}</div><div>&nbsp;</div><div>thanks</div>

{{ and }} are preferrable markers to << and >> in an (X)HTML document.

You can install HtmlAgilityPack using the Manage NuGet Packages for Solution menu item when right-clicking your solution.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I like your approach as well with the HtmlAgilityPack – wk4questions Jun 01 '15 at 08:41
  • :) Feel free to upvote then. Also mind that in case you have huge HTML pages and no match, you might run into catastrophical backtracking using `.*?` regex. I have seen that several times, that is why I'd recommend using alternatives. – Wiktor Stribiżew Jun 01 '15 at 08:44