-5

is it possible to remove all white spaces in the following HTML string in C#:

"
<html>

<body>

</body>

</html>
"

Thanks

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
Funky
  • 12,890
  • 35
  • 106
  • 161
  • 10
    Here: `` – Oded Aug 24 '12 at 10:47
  • 1
    To the downvoters, provide an answer along with your downvote. – OscarRyz Aug 24 '12 at 10:48
  • 6
    @OscarRyz - There is no obligation to do so. However, the question shows no research or effort. I am assuming this is why it is getting downvoted. The question is also not a realistic representation of an actual programming issue - it doesn't have _any_ context. – Oded Aug 24 '12 at 10:49
  • 2
    When dealing with HTML or any markup for that matter rather just hacking what you think is just a string, it's best to run it through a parser that understands it. You can use HtmlAgilityPack to parse it...and rewrite it out properly. http://htmlagilitypack.codeplex.com/ ... or use HTML Tidy....http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat ... http://tidy.sourceforge.net/ ... or something like it. – Colin Smith Aug 24 '12 at 10:50
  • 1
    Do you want to remove white-spaces or empty lines? – Tim Schmelter Aug 24 '12 at 10:58

4 Answers4

5

When dealing with HTML or any markup for that matter, it's usually best to run it through a parser that truly understands the rules of that markup.

The first benefit is that it can tell you if your initial input data is garbage to start with.

If the parser is smart enough it might even be able to correct badly formed markup automatically, or accept it with relaxed rules.

You can then modify the parsed content....and get the parser to write out the changes...this way you can be sure the markup rules are followed and you have correct output.

For some simple HTML markup scenarios or for markup that is so badly formed a parser just balks on it straight away, then yes you can revert to hacking the input string...with string replacements, etc....it all depends on your needs as to which approach you take.

Here are a couple of tools that can help you out:

HTML Tidy

You can use HTML Tidy and just specify some options/rules on how you want your HTML to be tidied up (e.g. remove superfluous whitespace).

It's a WIN32 DLL...but there are C# Wrappers for it.

HtmlAgilityPack

You can use HtmlAgilityPack to parse HTML if you need to understand the structure better and perhaps do your own tidying up/restructuring.

wp78de
  • 18,207
  • 7
  • 43
  • 71
Colin Smith
  • 12,375
  • 4
  • 39
  • 47
3
myString = myString.Replace(System.Environment.NewLine, "");
Venkata Krishna
  • 14,926
  • 5
  • 42
  • 56
1

You can use a regular expression to match white space characters for the replace:

s = RegEx.Replace(s, @"\s+", String.Empty);
Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • @Eric: Yes, it works. It removes white space in the string. Space characters are also white space. Did you downvote my answer because you invented another requirement for the question? – Guffa Aug 25 '12 at 17:57
  • I guess I made the assumption that the result should also be valid HTML. – Eric Aug 25 '12 at 21:21
  • @Eric: And the assumption that the string would contain something completely different from the example in the question... – Guffa Aug 25 '12 at 21:40
  • You made that assumption too. If you assume the HTML is as shown, then Odad's comment is the only sensible answer. – Eric Aug 26 '12 at 11:59
-1

I used this solution (in my opinion it works well. See also test code):

  1. Add an extension method to trim the HTML string:
public static string RemoveSuperfluousWhitespaces(this string input)
{
    if (input.Length < 3) return input;
    var resultString = new StringBuilder(); // Using StringBuilder is much faster than using regular expressions here!
    var inputChars = input.ToCharArray();
    var index1 = 0;
    var index2 = 1;
    var index3 = 2;
    // Remove superfluous white spaces from the html stream by the following replacements:
    //  '<no whitespace>' '>' '<whitespace>' ==> '<no whitespace>' '>'
    //  '<whitespace>' '<' '<no whitespace>' ==> '<' '<no whitespace>'
    while (index3 < inputChars.Length)
    {
        var char1 = inputChars[index1];
        var char2 = inputChars[index2];
        var char3 = inputChars[index3];
        if (!Char.IsWhiteSpace(char1) && char2 == '>' && Char.IsWhiteSpace(char3))
        {
            // drop whitespace character in char3
            index3++;
        }
        else if (Char.IsWhiteSpace(char1) && char2 == '<' && !Char.IsWhiteSpace(char3))
        {
            // drop whitespace character in char1
            index1 = index2;
            index2 = index3;
            index3++;
        }
        else
        {
            resultString.Append(char1);
            index1 = index2;
            index2 = index3;
            index3++;
        }
    }

    // (index3 >= inputChars.Length)
    resultString.Append(inputChars[index1]);
    resultString.Append(inputChars[index2]);
    var str = resultString.ToString();
    return str;
}

// 2) add test code:

[Test]
public void TestRemoveSuperfluousWhitespaces()
{
    var html1 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
    var html2 = $"<td class=\"keycolumn\">{Environment.NewLine}<p class=\"mandatory\">Some recipe parameter name</p>{Environment.NewLine}</td>";
    var html3 = $"<td class=\"keycolumn\">{Environment.NewLine} <p class=\"mandatory\">Some recipe parameter name</p> {Environment.NewLine}</td>";
    var html4 = " <td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
    var html5 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td> ";
    var compactedHtml1 = html1.RemoveSuperfluousWhitespaces();
    compactedHtml1.Should().BeEquivalentTo(html1);
    var compactedHtml2 = html2.RemoveSuperfluousWhitespaces();
    compactedHtml2.Should().BeEquivalentTo(html1);
    var compactedHtml3 = html3.RemoveSuperfluousWhitespaces();
    compactedHtml3.Should().BeEquivalentTo(html1);
    var compactedHtml4 = html4.RemoveSuperfluousWhitespaces();
    compactedHtml4.Should().BeEquivalentTo(html1);
    var compactedHtml5 = html5.RemoveSuperfluousWhitespaces();
    compactedHtml5.Should().BeEquivalentTo(html1);
}