31

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.

EDIT: I should add I'm using .NET 2.0.


Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.

First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.

Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.

orad
  • 15,272
  • 23
  • 77
  • 113
FunLovinCoder
  • 7,597
  • 11
  • 46
  • 57

19 Answers19

22

If you want to remove lines containing any whitespace (tabs, spaces), try:

string fix = Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

Edit (for @Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:

string fix =
    Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline)
         .TrimEnd();
Chris Schmich
  • 29,128
  • 5
  • 77
  • 94
  • 1
    `\s+` instead of `\s*` would be better I think – Salman A May 19 '10 at 13:45
  • @Salman Chris' rx is correct, as is my lonely, unappreciated answer. ;-( – Sky Sanders May 19 '10 at 13:50
  • @Salman A: `\s+` would not work on totally empty lines, e.g. `"foo\n\nbar"`. – Chris Schmich May 19 '10 at 13:52
  • 1
    This works, but it can leave an extraneous newline at the end. –  Dec 08 '11 at 19:39
  • @Will: see my updated answer, or are you looking for a purely regex-based solution? – Chris Schmich Dec 08 '11 at 23:07
  • @ChrisSchmich, why even bother with `$`? Wouldn't `^\s*\n` do the same thing? `Regex.Replace(original, @"(?m)^\s*\n", "")` – Qtax Dec 09 '11 at 09:10
  • 2
    @ChrisSchmich: Yes, purely regex. When you have several 100mb strings in memory, you don't want to create new instances that differ by only "/r/n". If I can get it in one pass, I can rest a little easier on the memory pressure. –  Dec 09 '11 at 11:19
  • @Will: the best I've come up with is `Regex.Replace(original, @"((?<=^|\n)\s*\n)|(\s*$)", string.Empty);`. I haven't thoroughly tested it, nor do I know what the memory usage will be like, but it might work for you. This will chop *all* trailing white space (spaces, tabs, newlines, etc.). – Chris Schmich Dec 10 '11 at 01:06
  • In final testing. Unfortunately, your latest regex has a problem. I have [uploaded a simple app](https://skydrive.live.com/redir.aspx?cid=55025c1963e09246&resid=55025C1963E09246!222&parid=root) that I'm using to verify and run performance tests on, if you want to try again. Specifically, it fails on `"\r\n\test\r\ntest\r\n"`, returning `"test\ntest"`. –  Dec 14 '11 at 16:13
  • @Will: in your test app, it looks like the `RegexOptions.Multiline` option was added to my regex. I tested your example with my regex above (no multiline option, as intended), and the result was as expected: `"test\r\ntest"`. Can you try the test again without that option? – Chris Schmich Dec 15 '11 at 04:49
  • @ChrisSchmich: Sure; I went off the text of the question... Yours and Agmad's are equivalent in correctness. They both fail the same way, stripping *all* whitespace from the end of the last line, rather than just the last newline. Yours is a little better on memory (roughly .003% better, woot) but his runs roughly twice as fast. Appreciate your help on this. –  Dec 15 '11 at 14:58
  • @Yuki: Please provide a better justification than "not good at all" before downvoting. Also, please re-read the question. It's about removing blank lines from an arbitrary string, not just from a serialized JSON object. Your answer does not address the actual problem posted. – Chris Schmich Oct 01 '14 at 21:35
18
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
    string line;
    while((line = reader.ReadLine()) != null)
    {
        if (line.Trim().Length > 0)
            writer.WriteLine(line);
    }
    outputString = writer.ToString();
}
Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
  • +1 This one is nice since it should scale well for large strings. – Fredrik Mörk May 19 '10 at 13:34
  • 2
    Shouldn't this really be `if (line.Trim().Length > 0) writer.WriteLine(line)`? The OP did not request that all lines be trimmed in the output string. – Dan Tao May 19 '10 at 13:44
14

off the top of my head...

string fixed = Regex.Replace(input, "\s*(\n)","$1");

turns this:

fdasdf
asdf
[tabs]

[spaces]  

asdf


into this:

fdasdf
asdf
asdf
Sky Sanders
  • 36,396
  • 8
  • 69
  • 90
8

Using LINQ:

var result = string.Join("\r\n",
                 multilineString.Split(new string[] { "\r\n" }, ...None)
                                .Where(s => !string.IsNullOrWhitespace(s)));

If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.

dtb
  • 213,145
  • 36
  • 401
  • 431
  • there's no IsNullOrWhitespace method ;) – Thomas Levesque May 19 '10 at 13:32
  • @Thomas Levesque: orly? http://msdn.microsoft.com/en-us/library/system.string.isnullorwhitespace.aspx – dtb May 19 '10 at 13:33
  • my mistake... it's new in .NET 4.0, and I only have the local help for 3.5 – Thomas Levesque May 19 '10 at 13:34
  • This doesn't produce a single string as a result (it produces an enumeration of non-empty lines). I'm not sure that really answers the question completely. – Michael Petito May 19 '10 at 13:37
  • @Michael Petito: note the `string.Join` in the first line which concatenates the enumeration of non-empty lines back together. – dtb May 19 '10 at 13:38
  • @Michael: `string.Join` produces a single string. – Adam Robinson May 19 '10 at 13:38
  • 2
    Ah indeed it is hidden up there. In that case you need a .ToArray() unless you're using .NET 4.0. In my opinion this is far less readable than a regex and I'm not sure what you'd really gain in this approach. – Michael Petito May 19 '10 at 13:40
  • BTW, the OP is using .NET 2.0, so no LINQ... (unless he's using VS2008 + LinqBridge) – Thomas Levesque May 19 '10 at 13:46
  • @Thomas Levesque: That's why I upvoted your answer :-) The requirement was added after I posted my answer. – dtb May 19 '10 at 13:48
  • 4
    When did LINQ become the new regex? – Dinah May 19 '10 at 13:57
  • 5
    I recently used Linq to defrost my freezer. Why do something the old way when Linq is so cool? – Ash Jul 14 '10 at 06:54
  • 1
    why no Environment.NewLine and why bother with the linq when RemoveEmptyEntries does the same thing? –  Dec 08 '11 at 18:27
  • 1
    @Will: Environment.NewLine Channel its value depending on platform, which might be undesireful if the input string contains \r\n line breaks. RemoveEmptyEntries removes only empty entries, but not those that consist of one or more whitespace character. – dtb Dec 08 '11 at 18:53
4

Alright this answer is in accordance to the clarified requirements specified in the bounty:

I also need to remove any trailing newlines, and my Regex-fu is failing. My bounty goes to anyone who can give me a regex which passes this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") == "test\r\nthis"

So Here's the answer:

(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z

Or in the C# code provided by @Chris Schmich:

string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", @"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);

Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.

  1. (?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
  2. (?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
  3. (\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)

That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.

EDIT: @Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:

\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of @Will's provided test cases.

So all together now, it should be:

(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z

EDIT #2: Alright there is one more possible case @Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.

\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.

So now we've got:

\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z

So now we have four patterns for matching:

  1. whitespace at the beginning of the file,
  2. redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
  3. redundant line breaks with no content, (ex: \r\n\r\n)
  4. whitespace at the end of the file
BenSwayne
  • 16,810
  • 3
  • 58
  • 75
  • @Will this should satisfy your requirements with a single Regex.Replace. – BenSwayne Dec 09 '11 at 05:36
  • Ouch, that looks like a lot of work, but it also fails when there are mixed newlines and whitespace at the end of the string. For example, this string `"one\r\n \r\ntwo\r\n\t\r\n \r\n"` will be `"one\r\ntwo\r\n"` after the replace. –  Dec 09 '11 at 12:28
  • @Will I'll do an edit to address this *'bug'*. The Regex is a lot of work and not tested/becnmarked as the fastest way to remove lines from a string, but it was what you asked for. A one liner regex. – BenSwayne Dec 09 '11 at 16:21
  • Hmm, with the current edit `"test\r\n \r\nthis\r\n\r\n"` leaves an empty line between "test" and "this". –  Dec 09 '11 at 16:50
  • @Will This works for me testing in C#/.Net2. What environment are you running in? There are some subtle differences in regex between .Net and Perl, etc... I may be able to tweak it. – BenSwayne Dec 10 '11 at 00:11
  • I think I'm going to have to bounty this a couple times to thank everyone who took so much time to help me out on this. –  Dec 10 '11 at 17:57
  • In final testing. Unfortunately, your latest regex has a problem. I have [uploaded a simple app](https://skydrive.live.com/redir.aspx?cid=55025c1963e09246&resid=55025C1963E09246!222&parid=root) that I'm using to verify and run performance tests on, if you want to try again. Specifically, it fails on `"\r\ntest2"`, returning `""\r\ntest2""`. –  Dec 14 '11 at 16:16
  • @Will you are very skilled with scope creep my friend! :-) This is another use case not in the original bounty question. Please see edit #2 above for solution. I suppose its good to have a thorough answer for the community reference anyhow. If you are performance testing remember to use compiled regex! – BenSwayne Dec 14 '11 at 21:17
  • My bad. I should have published the solution prior to adding the bounty comment. I was, in this case, too specific. I should have left it as "remove all whitespace lines *and* the last newline". Your edit works pretty good, but it strips off *all* whitespace at the end of the last line, and not *just* the newline. –  Dec 14 '11 at 21:24
3

not good. I would use this one using JSON.net:

var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
Yuki
  • 742
  • 7
  • 11
2

In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.

Use RegexOptions.Multiline with this pattern:

^\s+(?!\B)|\s*(?>[\r\n]+)$

Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.

string[] inputs = 
{
    "one\r\n \r\ntwo\r\n\t\r\n \r\n",
    "test\r\n \r\nthis\r\n\r\n",
    "\r\n\r\ntest!",
    "\r\ntest\r\n ! test",
    "\r\ntest \r\n ! "
};
string[] outputs = 
{
    "one\r\ntwo",
    "test\r\nthis",
    "test!",
    "test\r\n ! test",
    "test \r\n ! "
};

string pattern = @"^\s+(?!\B)|\s*(?>[\r\n]+)$";

for (int i = 0; i < inputs.Length; i++)
{
    string result = Regex.Replace(inputs[i], pattern, "",
                                  RegexOptions.Multiline);
    Console.WriteLine(result == outputs[i]);
}

EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.

Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
  • Nice try, but it isn't perfect. It fails when mixing whitespace and newlines near the end of the string. `"one\r\n \r\ntwo\r\n\t\r\n \r\n"` will still have that newline at the end. –  Dec 09 '11 at 12:25
  • @Will thanks for the feedback. I've updated the pattern and sample code to address the new test case. Give that a try. I also cleaned up the post with regards to the space being eaten up and opted to keep the `(?!\B)` portion in `^\s+(?!\B)` since I think that's closer to the spirit of the request and maintains spaces where a valid character exists. – Ahmad Mageed Dec 09 '11 at 14:16
  • 1
    Aaah, much better. I'll spend some time today (styling and) profiling and running test cases on it. Thanks. –  Dec 09 '11 at 14:23
  • In final testing. Your regex is the best so far, but the only issue I'm having is that if there is *whitespace* on the last line, it removes all of it, not just the last newline. In other words, `"test\s\r\ntest\s\r\n"` is returned `"test\s\r\ntest"`. I have [uploaded a simple app](https://skydrive.live.com/redir.aspx?cid=55025c1963e09246&resid=55025C1963E09246!222&parid=root) that I'm using to verify and run performance tests on, if you want to try again. –  Dec 14 '11 at 16:19
  • @Will I'll try to take a look later tonight. I updated the pattern slightly to shorten it, but it doesn't do anything new to address your last comment. – Ahmad Mageed Dec 14 '11 at 18:08
  • @Will I downloaded the sample but couldn't come up with a pattern to clean up that last scenario. I spent some time trying my hand at a conditional pattern that would try to account for that final `\r\n` and preserve spaces but it didn't work out. – Ahmad Mageed Dec 15 '11 at 05:32
  • Well, you and Chris Schmich have essentially the same issue, and that doesn't appear to be at all easy to fix. I'll place the bounty now and work on tweaking it later. So, having to choose between the two of you, I profiled both of your expressions. His wins on memory by an extremely slight margin. Your expression, however, runs twice as fast. So I'm awarding you the bounty. Appreciate all your help on this. –  Dec 15 '11 at 15:01
  • @Will my pleasure, and thanks for sharing the general profiling findings. – Ahmad Mageed Dec 15 '11 at 22:28
1

if its only White spaces why don't you use the C# string method

    string yourstring = "A O P V 1.5";
    yourstring.Replace("  ", string.empty);

result will be "AOPV1.5"

dnxit
  • 7,118
  • 2
  • 30
  • 34
1
string corrected = 
    System.Text.RegularExpressions.Regex.Replace(input, @"\n+", "\n");
Adam Robinson
  • 182,639
  • 35
  • 285
  • 343
1

Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.

public static string RemoveEmptyLines(this string text) {
    var builder = new StringBuilder();

    using (var reader = new StringReader(text)) {
        while (reader.Peek() != -1) {
            string line = reader.ReadLine();
            if (!string.IsNullOrWhiteSpace(line))
                builder.AppendLine(line);
        }
    }

    return builder.ToString();
}

Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:

public static bool IsNullOrWhiteSpace(string text) {
    return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
Dan Tao
  • 125,917
  • 54
  • 300
  • 447
  • @Adam: Ha, wow, very stupid statement I made there. I meant no intermediate *arrays*, as the `string.Split` method would (thanks). – Dan Tao May 19 '10 at 13:40
1

I'll go with:

  public static string RemoveEmptyLines(string value) {
    using (StringReader reader = new StringReader(yourstring)) {
      StringBuilder builder = new StringBuilder();
      string line;
      while ((line = reader.ReadLine()) != null) {
        if (line.Trim().Length > 0)
          builder.AppendLine(line);
      }
      return builder.ToString();
    }
  }
Julien Lebosquain
  • 40,639
  • 8
  • 105
  • 117
1

In response to Will's bounty here is a Perl sub that gives correct response to the test case:

sub StripWhitespace {
    my $str = shift;
    print "'",$str,"'\n";
    $str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
    print "'",$str,"'\n";
    return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");

output:

'test

this

'
'test
this'

In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:

$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;

There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.

$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Er, I can use Perl-compat regexes... but I'm not familiar with Perl. Can you just clarify what the regex is? I think I sussed it out, but I want to be sure. Thanks. (edit) uh, yeah, for example I just learned about the s/ operator. Also, if there are any configuration options required (multiline etc) (edit edit) *Also* It has to be PCRE 5; 7 won't cut it. \R is too new an addition. –  Dec 09 '11 at 14:48
  • Hmmm, I can't seem to get it to work. It *does* remove empty lines, and any trailing newlines, but it also crops the last non-whitespace char of every line. Might still be a problem with conversion. Any chance you can give me the regex without *any perl syntax whatsoever?* –  Dec 09 '11 at 16:48
0

String Extension

public static string UnPrettyJson(this string s)
{
    try
    {
        // var jsonObj = Json.Decode(s);
        // var sObject = Json.Encode(value);   dont work well with array of strings c:['a','b','c']

        object jsonObj = JsonConvert.DeserializeObject(s);
        return JsonConvert.SerializeObject(jsonObj, Formatting.None);
    }
    catch (Exception e)
    {
        throw new Exception(
            s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
    }
}
Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
Math
  • 768
  • 1
  • 8
  • 18
0
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Ben Hoffstein
  • 102,129
  • 8
  • 104
  • 120
0

Im not sure is it efficient but =)

  List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
  myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
albatross
  • 455
  • 2
  • 8
  • 27
0

Here is something simple if working against each individual line...

(^\s+|\s+|^)$
kgoedtel
  • 31
  • 2
0

Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips

  1. All empty lines from the start of a string
    • Not including any spaces at the beginning of the first non-whitespace line
  2. All empty lines after the first non-whitespace line and before the last non-whitespace line
    • Again, preserving all whitespace at the beginning of any non-whitespace line
  3. All empty lines after the last non-whitespace line, including the last newline

(?<=(\r\n)|^)\s*\r\n|\r\n\s*$

which essentially says:

  • Immediately after
    • The beginning of the string OR
    • The end of the last line
  • Match as much contiguous whitespace as possible that ends in a newline*
  • OR
  • Match a newline and as much contiguous whitespace as possible that ends at the end of the string

The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.

Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.

*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)

-1

Try this.

string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);

string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
Sky Sanders
  • 36,396
  • 8
  • 69
  • 90
dretzlaff17
  • 1,699
  • 3
  • 19
  • 24
  • What if i am reading a file imported from a unix system, then my windows Environment.NewLine wont match the new lines from the file. – felickz Dec 08 '11 at 21:09
-2
s = Regex.Replace(s, @"^[^\n\S]*\n", "");

[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:

s = Regex.Replace(s, @"^[ \t\r]*\n", "");

And if you want it to catch the last line, without a final linefeed:

s = Regex.Replace(s, @"^[ \t\r]*\n?", "");
Alan Moore
  • 73,866
  • 12
  • 100
  • 156