2

I'm looking for a way to check a string containing html and decide if it contains any text that should be visible, not counting whitespaces.

Basically anything should count as visible if it shows up as visible text upon rendering it as the innerHTML of a <div>.

For example

  • <div>hello</div> is visible, as "hello" is shown in the browser.
  • <div><p> <br/></div>&nbsp; is not visible.
  • <script>alert('asdf')</script> is not visible.
  • plain text is visible, although it does not contain any html tags.

There are a lot of cases where I'm not sure (any result would be acceptable):

  • <div style="display: none">this is tricky</div> is not visible, but since css adds an other layer of complexity to the question, it might be a good idea not to bother with it.
  • <script>document.write('What is this, I don't even-')</script> should be outside the scope of this question.
  • <input value="Read this"> is visible, but I don't care about form elements at the moment, so this might as well be not visible.

I want to decide this server-side and deal with the situation accordingly.

Is there a good way to decide this in C#? Writing my own solution seems tedious, and I was wondering if someone already did this (or something similar).

EDIT:

Is the question this incomprehensible? I already stated that I want to do it on the server, not in a browser environment. jQuery and jsfiddle have little relevance here.

vinczemarton
  • 7,756
  • 6
  • 54
  • 86
  • You're not using the right tool for this job; that's why the problems seems "hard". If you're doing a postback and need to check the length of some server control then use the C# .Length method on whatever element you have. The EASIEST way to do it is with jQuery's .html().length and then postback whatever value comes out of this instead of posting back the entire control. – frenchie Dec 05 '13 at 01:01
  • I don't really know what to say to this. Why would I do any postbacks? I'm not sending ANYTHING client side yet. These strings are from a database containing descriptions of some stuff in HTML. There is some garbage data, where the description is supposed to be empty, but `string.IsNullOrEmpty()` returns false as it contains html tags. – vinczemarton Dec 05 '13 at 01:17
  • are you open to libraries like [Html Agility Pack](http://htmlagilitypack.codeplex.com/wikipage?title=Examples) ? – lastr2d2 Dec 05 '13 at 01:20
  • @lastr2d2 : Yes of course. I'm currently looking at http://stackoverflow.com/questions/13248789/detecting-string-containing-only-html-and-no-text which is a similar question. – vinczemarton Dec 05 '13 at 01:21
  • 3
    ref to questions like this http://stackoverflow.com/questions/6344771/htmlagilitypack-iterate-all-text-nodes-only and related questions on the right.. I believe It surely will do the magic – lastr2d2 Dec 05 '13 at 01:26
  • Yes this will be suitable, thanks :) – vinczemarton Dec 05 '13 at 01:28

1 Answers1

1
  public static bool StripHTMLAndCheckVisible(string HTMLText)
    {
        if (string.IsNullOrEmpty(HTMLText))
            return false;
        else
        {
            Regex regJs=new Regex(@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)",RegexOptions.IgnoreCase);
            HTMLText = regJs.Replace(HTMLText, "");
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            HTMLText = reg.Replace(HTMLText, "");
            return string.IsNullOrEmpty(HTMLText) ? false : true;
        }
    }

It will remove all HTML,Javascript tag, return true if visible, false if not. Hope this help.


EDIT:

What I ended up doing is:

public static bool CheckHTMLForText(string html)
{
    var stripped = StringHelpers.StripTagsWithContent(html, "script", "style");
    stripped = StringHelpers.StripTagsRegex(stripped);
    return string.IsNullOrWhiteSpace(stripped);
}

Where StringHelpers.StripTagsWithContent() strips a given tag along with it's content through the end of it's closing tag (like the one for the script tage in the example above), and StringHelpers.StripTagsRegex() removes the tags from a string.

vinczemarton
  • 7,756
  • 6
  • 54
  • 86
Quan Truong
  • 154
  • 7
  • Although I like the idea, and I already have a method doing something similar, this might not be sufficient, as there are other tags like `` I need to prepare for and who knows what else. Using the html agility pack seems to be more robust at the moment, but this is way faster. I need to experiment with this. If this suits my needs you might have earned an accept :) – vinczemarton Dec 05 '13 at 08:30