I convert a HTML code to plain text.But there are many extra returns and spaces.How to remove them?
4 Answers
string new_string = Regex.Replace(orig_string, @"\s", "")
will remove all whitespace
string new_string = Regex.Replace(orig_string, @"\s+", " ")
will just collapse multiple whitespaces into one

- 55,313
- 14
- 116
- 115
-
-
-
1`\s` is a shorthand for space, newline, tab and form feed (and some other whitespace in some implementations), so it will remove those returns and convert them into a single space. – Tim Pietzcker Feb 11 '11 at 20:56
I'm assuming that you want to
- find two or more consecutive spaces and replace them with a single space, and
- find two or more consecutive newlines and replace them with a single newline.
If that's correct, then you could use
resultString = Regex.Replace(subjectString, @"( |\r?\n)\1+", "$1");
This keeps the original "type" of whitespace intact and also preserves Windows line endings correctly. If you also want to "condense" multiple tabs into one, use
resultString = Regex.Replace(subjectString, @"( |\t|\r?\n)\1+", "$1");
To condense a string of newlines and spaces (any number of each) into a single newline, use
resultString = Regex.Replace(subjectString, @"(?:(?:\r?\n)+ +){2,}", @"\n");

- 328,213
- 58
- 503
- 561
-
+1 for maintaining new lines and only collapsing duplicates of the same type – John McDonald Feb 11 '11 at 21:44
-
good,but there is on more suitation cannot solve.like `\n \n \n \n \n \n \n \n \n \n` returns mixed with spaces – Shisoft Feb 12 '11 at 09:50
-
And what do you want the result to be in that case? What if you have something like `\n\n \n\n \n\n` or `\n \n \n \n` or `\n \n\n \n \n\n \n \n\n` etc.? – Tim Pietzcker Feb 12 '11 at 10:22
-
@Tim Pietzcker I want sequent spaces to " " sequent returns to "\n",and then returns with spaces like `\n\n \n\n \n\n \n` to "\n".ps:there are more than one space between "\n"s – Shisoft Feb 12 '11 at 10:33
-
I think I can do it by replacing the string twice.First jsut like the answer.Next,replace `\n \n \n \n \n \n \n \n \n \n`. – Shisoft Feb 12 '11 at 10:45
-
I have added another regex for this case; this would have to be applied before or after the other regex. – Tim Pietzcker Feb 12 '11 at 13:03
-
(.NET) If you want to keep the carriage returns which are included with \s, use [ \t] instead of \s. //remove multiple carriage returns txt = Regex.Replace(txt, @"( |\r?\n)\1+", "$1"); // remove duplicate blank spaces or multiple tabs txt = Regex.Replace(txt, @"[ \t]+", " "); // remove blank lines or lines consisting of spaces and tabs txt = Regex.Replace(txt, @"^[ \t]+$[\r\n]*", "", RegexOptions.Multiline).Trim(); – Allen Jan 29 '15 at 15:52
-
@TimPietzcker What would be the change in this regex, `Regex.Replace(subjectString, @"( |\r?\n)\1+", "$1");` If I want to remove the `\r\n` as well – Ammar Khan Mar 27 '17 at 09:10
-
@AmmarKhan: Would you also want to remove spaces? If yes, just replace the match with an empty string. If not, it's complicated, and an answer wouldn't fit inside a comment. – Tim Pietzcker Mar 27 '17 at 20:46
I used a lot of algorithm for that. Every loop was good but this was clear and absolute.
//define what you want to remove as char
char tb = (char)9; //Tab char ascii code
spc = (char)32; //space char ascii code
nwln = (char)10; //New line char ascii char
yourstring.Replace(tb,"");
yourstring.Replace(spc,"");
yourstring.Replace(nwln,"");
//by defining chars, result was better.

- 139
- 1
- 1
- 9
You can use Trim() to remove the spaces and returns. In HTML the spaces is not important so you can omit them by using the Trim() method in System.String class.

- 918
- 1
- 6
- 14
-
1
-
In fact, only leading and trailing characters are supported: http://msdn.microsoft.com/en-us/library/system.string.trim.aspx. +1 for suggesting an alternative though, maybe try to expand idea on this for the OP and give a regexless solution? – Grant Thomas Feb 11 '11 at 20:20
-
You can remove white spaces and also other chars you may want to remove. If you want to remove returns, I think the best way is Use this: "Your Html".Trim('\n') – Mohammad M. Ramezanpour Feb 11 '11 at 20:23
-
2The point is, it only removes them from the **beginning** and **end** of the string. The OP is trying to collapse whitespace throughout the string. `Trim` may be useful, but it won't do the whole job. – Alan Moore Feb 12 '11 at 04:11