13

I convert a HTML code to plain text.But there are many extra returns and spaces.How to remove them?

Shisoft
  • 4,197
  • 7
  • 44
  • 61
  • it sounds obvious, but if replacing spaces and CRLFs doesn't beautify your HTML enough, you may consider using an [HTML formatter](http://stackoverflow.com/a/15120971/382515) – Ivan Ferrer Villa Sep 30 '16 at 15:36

4 Answers4

18

string new_string = Regex.Replace(orig_string, @"\s", "") will remove all whitespace

string new_string = Regex.Replace(orig_string, @"\s+", " ") will just collapse multiple whitespaces into one

Daniel DiPaolo
  • 55,313
  • 14
  • 116
  • 115
16

I'm assuming that you want to

  • find two or more consecutive spaces and replace them with a single space, and
  • find two or more consecutive newlines and replace them with a single newline.

If that's correct, then you could use

resultString = Regex.Replace(subjectString, @"( |\r?\n)\1+", "$1");

This keeps the original "type" of whitespace intact and also preserves Windows line endings correctly. If you also want to "condense" multiple tabs into one, use

resultString = Regex.Replace(subjectString, @"( |\t|\r?\n)\1+", "$1");

To condense a string of newlines and spaces (any number of each) into a single newline, use

resultString = Regex.Replace(subjectString, @"(?:(?:\r?\n)+ +){2,}", @"\n");
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • +1 for maintaining new lines and only collapsing duplicates of the same type – John McDonald Feb 11 '11 at 21:44
  • good,but there is on more suitation cannot solve.like `\n \n \n \n \n \n \n \n \n \n` returns mixed with spaces – Shisoft Feb 12 '11 at 09:50
  • And what do you want the result to be in that case? What if you have something like `\n\n \n\n \n\n` or `\n  \n  \n  \n` or `\n \n\n \n \n\n \n \n\n` etc.? – Tim Pietzcker Feb 12 '11 at 10:22
  • @Tim Pietzcker I want sequent spaces to " " sequent returns to "\n",and then returns with spaces like `\n\n \n\n \n\n \n` to "\n".ps:there are more than one space between "\n"s – Shisoft Feb 12 '11 at 10:33
  • I think I can do it by replacing the string twice.First jsut like the answer.Next,replace `\n \n \n \n \n \n \n \n \n \n`. – Shisoft Feb 12 '11 at 10:45
  • I have added another regex for this case; this would have to be applied before or after the other regex. – Tim Pietzcker Feb 12 '11 at 13:03
  • (.NET) If you want to keep the carriage returns which are included with \s, use [ \t] instead of \s. //remove multiple carriage returns txt = Regex.Replace(txt, @"( |\r?\n)\1+", "$1"); // remove duplicate blank spaces or multiple tabs txt = Regex.Replace(txt, @"[ \t]+", " "); // remove blank lines or lines consisting of spaces and tabs txt = Regex.Replace(txt, @"^[ \t]+$[\r\n]*", "", RegexOptions.Multiline).Trim(); – Allen Jan 29 '15 at 15:52
  • @TimPietzcker What would be the change in this regex, `Regex.Replace(subjectString, @"( |\r?\n)\1+", "$1");` If I want to remove the `\r\n` as well – Ammar Khan Mar 27 '17 at 09:10
  • @AmmarKhan: Would you also want to remove spaces? If yes, just replace the match with an empty string. If not, it's complicated, and an answer wouldn't fit inside a comment. – Tim Pietzcker Mar 27 '17 at 20:46
0

I used a lot of algorithm for that. Every loop was good but this was clear and absolute.

//define what you want to remove as char

char tb = (char)9; //Tab char ascii code
spc = (char)32;    //space char ascii code
nwln = (char)10;   //New line char ascii char

yourstring.Replace(tb,"");
yourstring.Replace(spc,"");
yourstring.Replace(nwln,"");

//by defining chars, result was better.
ithnegique
  • 139
  • 1
  • 1
  • 9
-2

You can use Trim() to remove the spaces and returns. In HTML the spaces is not important so you can omit them by using the Trim() method in System.String class.

  • 1
    I think trim can just remove start space and end space – Shisoft Feb 11 '11 at 20:14
  • In fact, only leading and trailing characters are supported: http://msdn.microsoft.com/en-us/library/system.string.trim.aspx. +1 for suggesting an alternative though, maybe try to expand idea on this for the OP and give a regexless solution? – Grant Thomas Feb 11 '11 at 20:20
  • You can remove white spaces and also other chars you may want to remove. If you want to remove returns, I think the best way is Use this: "Your Html".Trim('\n') – Mohammad M. Ramezanpour Feb 11 '11 at 20:23
  • 2
    The point is, it only removes them from the **beginning** and **end** of the string. The OP is trying to collapse whitespace throughout the string. `Trim` may be useful, but it won't do the whole job. – Alan Moore Feb 12 '11 at 04:11