0

I have a string, that can have some html tags. I'd like to remove some of them (with the data on it), but not all tags.

In fact I'd like to remove <img /> and <div>...</div>.

So for example, if I have the string hello <div>bye bye</div> marco Id like to get hello marco.

How can I do this on C#?

markzzz
  • 47,390
  • 120
  • 299
  • 507
  • 2
    Be aware that Regular Expressions won't be able to correctly handle divs inside other divs, so they're not ideal for this scenario – William Lawn Stewart Jun 16 '11 at 08:56
  • I know. Any other ideas? – markzzz Jun 16 '11 at 09:12
  • @William: He is using C#. The .NET regex implementation can handle nested divs (see http://blogs.msdn.com/b/bclteam/archive/2005/03/15/396452.aspx). But you really really do not want to do this. =) – Jens Jun 16 '11 at 09:17

2 Answers2

7

I think you are aware about people's general opinion about parsing HTML with regex. I would recommend you using a HTML parser such as HTML Agility Pack.

Here's a sample:

class Program
{
    static void Main()
    {
        var doc = new HtmlDocument();
        doc.LoadHtml("hello <div>bye bye</div> marco <img src=\"http://example.com\"/> test");

        for (int i = 0; i < doc.DocumentNode.ChildNodes.Count; i++)
        {
            var child = doc.DocumentNode.ChildNodes[i];
            if (child.NodeType == HtmlNodeType.Element && new[] { "div", "img" }.Contains(child.Name, StringComparer.OrdinalIgnoreCase))
            {
                doc.DocumentNode.RemoveChild(child);
            }
        }

        var sb = new StringBuilder();
        using (var writer = new StringWriter(sb))
        {
            doc.Save(writer);
        }
        Console.WriteLine(sb); // prints "hello  marco  test"
     }

}
Community
  • 1
  • 1
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • 2
    Haha "parsing HTML with regex" is the best post EVER. :) – Filip Ekberg Jun 16 '11 at 08:52
  • @markzzz: You're going to need to actually parse it *sometime*, and once you've learned your way around such a library, it's generally actually *easier* than writing a regex since you can use tools that talk on the level you're thinking (e.g. XPath for elements) rather than mind-bending token-twisting. Learn it once, and be done with it. – Eamon Nerbonne Jun 16 '11 at 09:01
2

It is not a good idea to use regex for XML. Depending on the language you should use some XML library.

In this case the regex is pretty simple, though:

        string s = "hello <div>bye bye</div> marco <img />";

        Regex rgx = new Regex("(<div>[^<]*</div>)|(<img */>)");
        s = rgx.Replace(s, "");
Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95
  • 2
    Please do not encourage this.. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Filip Ekberg Jun 16 '11 at 08:53
  • @fiver, No, it solves his problem. Even though solving the problem like this is a bad idea. – Filip Ekberg Jun 16 '11 at 08:57
  • Won't work if there's divs inside other divs, which is why Regexes aren't a good idea for this sort of thing – William Lawn Stewart Jun 16 '11 at 08:58
  • True! Well, in case of nested divs it won't break the XML - it will remove the innermost div only. But I totally agree. Regex + XML = Disaster – Petar Ivanov Jun 16 '11 at 09:03
  • @Filip Ekberg : I don't know any way to solve this problem, do you? P.S. Why do you talk about XML? I'm doing it on HTML :) – markzzz Jun 16 '11 at 09:10
  • I think you should use `.*?` instead of `[^<]*` to include constructs like "
    ...
    ` to `img [^>]*>` to account for attributes.
    – Jens Jun 16 '11 at 09:14
  • Tried your solution, but with this string `Hello
    my name is Eric
    Marco`, for example, nothing change :(
    – markzzz Jun 16 '11 at 09:24
  • The proposed solution will eat everything from the first div through the last /div. Might want to make the `*` lazy like this: `*?` – agent-j Jun 16 '11 at 09:29
  • @fiver : please try replacing this `string s = "Hello
    my name is Eric
    Marco";` : it wont works. Also if I change regex from `(
    [^<]` to `(
    – markzzz Jun 16 '11 at 09:45