An algorithm using LINQ or C# to sanitize specific HTML from a string

Question

Background Info: I have a large body of text that I regularly encapsulate in a single string from an XML document(using LINQ). This string contains lots of HTML that I need to preserve for output purposes, but the emails and discrete HTML links that occasionally occur in this string need to be removed. An Example of the offending text looks like this:

--<a href="mailto:jsmith@email.com" target="_blank">John Smith</a> from <a href="http://www.agenericwebsite.com" target="_blank">Romanesque Architecture</a></p>

What I need to be able to do is:

Find the following string: <a href
Delete that string and all characters following it through the string >
Also, always delete this string </a>

Is there a way with LINQ that I can do this easily or am I going to have to create an algorithm using .NET string manipulation to achieve this?

Why do you want to use LINQ? This looks like regex/string manipulation would be much simpler — Tom Squires, Nov 14 '11 at 17:40
+1 @AustinSalonen The only answer for any question regarding processing html! Html and regex is an accident waiting to happen. And I like regex :) — Goran, Nov 14 '11 at 17:59

score 2 · Accepted Answer · edited May 23 '17 at 11:55

2

You could probably do this with LINQ, but it sounds like a regular old REGEX would be much, much better.

It sounds like this question, and particularly this answer demonstrate what you're trying to do.

edited May 23 '17 at 11:55

Community

1
1

answered Nov 14 '11 at 17:40

Adam Rackis

82,527
56
270
393

Ah, Regex. I was afraid so. Unfortunately, I haven't ever used it, but now is a good time to learn. Now, I understand that Regex helps identify substrings and patterns within a string, but if I apply the techniques in the link you provided, how am I going to get around that the ending delimiter for most of my emails and HTML links is >, which appears frequently in other places in my text? Thanks for the help by the way. – Isaiah Nelson Nov 14 '11 at 17:51
@full - not sure I understand. Can't you use the technique from the answer to search for strings starting with – Adam Rackis Nov 14 '11 at 17:56
I probably can. My response was based on a limited knowledge of the capabilities of Regex. Do you or anyone have a favorite source for reading up on it? – Isaiah Nelson Nov 14 '11 at 17:58
@full - no, actually my regex knowledge is fairly limited. I know this is the perfect situation for a regex, but I'm not sure what the details of implementing it would be. Just use the links above to get you started, make a good attempt, then ask a new question when you get stuck :) – Adam Rackis Nov 14 '11 at 18:06

score 1 · Answer 2 · edited Dec 14 '13 at 17:14

If you want to do this exactly via LinqToXml, try something like this recursive function:

    static void ReplaceNodesWithContent(XElement element, string targetElementname)
    {
        if (element.Name == targetElementname)
        {
            element.ReplaceWith(element.Value);
            return;
        }

        foreach (var child in element.Elements())
        {
            ReplaceNodesWithContent(child, targetElementname);
        }
    }

Usage example:

    static void Main(string[] args)
    {
        string xml = @"<root>
<items>
    <item>
        <a>inner</a>
    </item>
    <item>
        <subitem>
            <a>another one</a>
        </subitem>
    </item>
</items>

";

        XElement x = XElement.Parse(xml);

        ReplaceNodesWithContent(x, "a");

        string res = x.ToString();
        //            res == @"<root>
        //                      <items>
        //                        <item>inner</item>
        //                        <item>
        //                          <subitem>another one</subitem>
        //                        </item>
        //                      </items>
        //                    </root>"
    }

Yeah, I can definitely see where you are going with this. Thanks for the input, but Ill probably be taking this opportunity to learn Regex. — Isaiah Nelson, Nov 14 '11 at 18:06

An algorithm using LINQ or C# to sanitize specific HTML from a string

2 Answers2