186

How do you split multi-line string into lines?

I know this way

var result = input.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

looks a bit ugly and loses empty lines. Is there a better solution?

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
Konstantin Spirin
  • 20,609
  • 15
  • 72
  • 90
  • Possible duplicate of [Easiest way to split a string on newlines in .NET?](https://stackoverflow.com/questions/1547476/easiest-way-to-split-a-string-on-newlines-in-net) – Robin Bennett May 13 '19 at 08:40
  • Yes, you use the exact line delimiter present in the file, e.g. *just "\r\n"* or *just "\n"* rather than using *either `\r` or `\n`* and ending up with a load of blank lines on windows-created files. What system uses LFCR line endings, btw? – Caius Jard Feb 02 '22 at 06:45
  • @CaiusJard LFCR is used in RISC OS... It was used in some early microcomputers of the late 70s and early 80s, but it does not seems relevant anymore. – Loudenvier May 30 '22 at 21:30

12 Answers12

223
  • If it looks ugly, just remove the unnecessary ToCharArray call.

  • If you want to split by either \n or \r, you've got two options:

    • Use an array literal – but this will give you empty lines for Windows-style line endings \r\n:

      var result = text.Split(new [] { '\r', '\n' });
      
    • Use a regular expression, as indicated by Bart:

      var result = Regex.Split(text, "\r\n|\r|\n");
      
  • If you want to preserve empty lines, why do you explicitly tell C# to throw them away? (StringSplitOptions parameter) – use StringSplitOptions.None instead.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    Removing ToCharArray will make code platform-specific (NewLine can be '\n') – Konstantin Spirin Oct 02 '09 at 09:11
  • @Kon you should use Environment.NewLine if that is your concern. Or do you mean the origin of the text, rather than the location of execution? –  Jan 20 '11 at 17:03
  • 1
    @Will: on the off chance that you were referring to me instead of Konstantin: I believe (*strongly*) that parsing code should strive to work on all platforms (i.e. it should also read text files that were encoded on *different* platforms than the executing platform). So for parsing, `Environment.NewLine` is a no-go as far as I’m concerned. In fact, of all the possible solutions I prefer the one using regular expressions since only that handles all source platforms correctly. – Konrad Rudolph Jan 20 '11 at 17:14
  • lol didn't notice the name similarity. I agree completely in this case. –  Jan 20 '11 at 18:32
  • 3
    @Hamish Well just look at the documentation of the enum, or look in the original question! It’s `StringSplitOptions.RemoveEmptyEntries`. – Konrad Rudolph Oct 19 '11 at 16:41
  • Ah I see, my bad, I was looking within RegexOptions; have not had my coffee yet. – Hamish Grubijan Oct 19 '11 at 17:37
  • 9
    How about the text that contains '\r\n\r\n'. string.Split will return 4 empty lines, however with '\r\n' it should give 2. It gets worse if '\r\n' and '\r' are mixed in one file. – username Apr 27 '12 at 18:52
  • 2
    @SurikovPavel Use the regular expression. That is definitely the preferred variant, as it works correctly with any combination of line endings. – Konrad Rudolph Apr 27 '12 at 23:28
  • 1
    A minor point - I usually go with the verbatim string literal in the second argument to `Regex.Split`, i.e. - `var result = Regex.Split(text, @"\r\n|\r|\n");` In this case it works either way because the C# compiler interprets \n and \r in the same way that the regular expression parser does. In the general case though it might cause problems. – Ken Clement Nov 15 '17 at 22:18
  • 1
    Just adding my 2c worth. Since the OP wants to keep blank lines, you *can't* write a parser that works for any type of environment and/or handles mixed cases (i.e. the RegEx), because if you have '\n\r' how do you know it's one 'break' instead of two that are just encoded wrong? If it's the latter, it would be two blank lines, but if it's the former, he would only be one. You have to ask what is the source of the encodings. If the source is on the same platform as the parser (regardless of what platform it is) then you *can* use Environment.NewLine as the source is known. – Mark A. Donohoe Aug 20 '18 at 20:27
  • @MarqueIV There are different possible answers to this, all valid. One is to expect and require *consistent* text files. Another one is to not accept `"\r"` on its own as a line separator (because, let’s face it, no system has used this convention in well over a decade): the only actually used conventions are `"\r\n"` and `"\n"`. In fact, your example (`"\n\r"`) has *never* been a valid line break anywhere. Either read it as *two* line breaks or throw an error, but certainly don’t treat it as a single line break. – Konrad Rudolph Aug 21 '18 at 07:56
  • First things first, my text was a typo. Use '\r\n' and my point is still the same: you can't write a *universal* parser on a system if you're required to keep blank lines. Note that by adding the restriction that you're not to accepting '\r' by itself, and you only want to use '\n' to detect new lines, with that change, *you no longer have a universal parser* essentially proving my point that without such limitations, it can't (easily*) be done, and chances are doesn't need to be in the first place. (*It can playing with RegEx ordering and such, but that just makes it much slower.) – Mark A. Donohoe Aug 21 '18 at 08:42
  • @MarqueIV I think you misread my comment: since `"\r"` is never used as a delimiter, so you can easily write a universal parser that accepts all actually used delimiters; It’s done by simply splitting on `"\r\n|\n"`. There’s no need for anything more fancy than that. But, honestly, in practice there’s nothing wrong with the regex code shown in my answer, and it will work just fine with a file that mixes different styles of line breaks, including the obsolete `"\r"`. – Konrad Rudolph Aug 21 '18 at 09:23
  • If you have input that has mixed styles like you said, there's no way to differentiate between '\n\r' and '\n' and '\r' without making the assumption that there will never be an '\r', and when you make that assumption, then you've removed the condition that I just mentioned that causes the ambiguity. Plus, you can't make that assumption anyway as there are plenty of embedded hardware systems that use '\r'. That's why terminals give you three choices for line breaks. You need to know you're input up front. I guess we'll just have to disagree and each use what works for us. – Mark A. Donohoe Aug 21 '18 at 09:39
  • @MarqueIV That’s why my previous comment says “in practice” it works. You’re arguing from a pretty unlikely case. Yes, obviously such cases are ambiguous but I contend that they are not relevant enough to care, and these ambiguities are fundamentally unresolvable, anyway: *no* parsing strategy will work since the ambiguity is then in the data itself, not in the parsing process. – Konrad Rudolph Aug 21 '18 at 09:45
  • But I believe you just made my point for me. That's exactly why I just use Environment.NewLine by default, and only use something like the RegEx solution if you venture outside the realm of the more-likely scenarios. It happens, but as they say, a giant time-killer is implementing solutions for things that might happen, rather than things that do. Sure, *plan* for the future of course (i.e. don't design yourself into a corner where you can't make the change later), but don't actually implement a future until you actually need to. In other words, I don't think our points are that far off. – Mark A. Donohoe Aug 21 '18 at 14:23
  • @MarqueIV “That's exactly why I just use Environment.NewLine” — but that’s the *worst* thing you can do because now you start breaking lots of actual files, whereas my solution breaks approximately zero actually existing files. Check out how many modern text editors use only the system’s newline for line breaks (hint: none do). – Konrad Rudolph Aug 21 '18 at 14:23
  • Nothing is broken if you're never planning on getting anything that doesn't match your platform's encoding. If you know that (just like you know there may never be a '\r') then you're optimizing your results, not wasting time running things through a RegEx engine that don't need to be, which can kill a time-critical application. If you will have multiple encodings, then use the RegEx. You just can't do universal. Again, I don't think we're arguing the same point. You've made yours and I've made a different one. Tangential, but not in contradiction. – Mark A. Donohoe Aug 21 '18 at 14:25
  • @MarqueIV I honestly have trouble understanding your use-case: You don’t need to go beyond your current platform to encounter text files that use different line ending conventions. I know for a fact that my current system contains files with different conventions (I edited one just yesterday, and I only know about the diverging line endings because `diff` flagged them). This isn’t “planning for the future”, it’s making code robust for the here and now. – Konrad Rudolph Aug 21 '18 at 14:27
  • Plus, taking a step back, one could argue that if you *do* need blank lines but *don't* enforce a standard for line encodings, then you're just asking for trouble anyway. After all, if you skip blank lines, you *can* write a universal parser, rendering this entire convo thread obsolete! :) – Mark A. Donohoe Aug 21 '18 at 14:27
  • And in your case, I'd argue the 'platform' is you using editing tools that may have differing line endings, hence you getting your diff. But if you're using a known format for instance, from another system, and not something manually edited, then there's no need to plan for that case and you can increase throughput of processing by not. Again, *we're not arguing the same point!*. Time and place. If you're taking in user-editable files, then I 100% agree with you. But if you're taking in system-generated files from a known system on the same platform, then I stand by my original statement. :) – Mark A. Donohoe Aug 21 '18 at 14:33
  • @MarqueIV No, nothing was mangled. The files have different (but internally consistent) line endings because they were created by different people, on different platforms. Yet they end up on my machine. — And I want to emphasise that we are *very much* arguing the same point, because I’m fundamentally not understanding where your potential use-case exists. I simply don’t see when it would be more useful, and produce less problems, to split on a platform hard-coded newline rather than using my heuristic, which I (and clearly many others) have found to work in 100% of real files. – Konrad Rudolph Aug 21 '18 at 14:34
  • "Created by different people, on different platforms". That is a different use-case than something say from a web service where the line endings are predictable and consistent. And if that system is on the same platform, then you can use Environment.NewLine and crush the performance of RegEx. Again, time and place. I plan for, but don't implement solutions for things until they happen. Just like the code, developer productivity is also increased. – Mark A. Donohoe Aug 21 '18 at 14:36
  • To hopefully appease you, if you're saying you need a system that has to detect blank lines, and you are taking files created on platforms with differing line endings, and you're guaranteeing you will never get '\r' by itself and/or your line endings will be consistent in the same file (which you can't if it's edited on machines with two different line endings and all line endings aren't updated), then I agree... the regex works. But I'm saying if you *can't* make those guarantees, it won't because you then won't be able to differentiate between '\n\r' and '\n' and '\r'. Make sense? – Mark A. Donohoe Aug 21 '18 at 14:46
  • In fairness, *nothing* will work in that case, not just RegEx because there is no standard for the line endings on the parser, which brings me back to one of my earlier points, if you are saying blank lines are important to you, then you must define what represents a blank line or you can't answer the above question (without those other guarantees that is.) – Mark A. Donohoe Aug 21 '18 at 14:50
  • More precision might help: it is not possible to write a parser to handle a combination of all cases, the RE here will handle combinations of any two cases in one file. – Mic Sep 23 '18 at 10:46
161
using (StringReader sr = new StringReader(text)) {
    string line;
    while ((line = sr.ReadLine()) != null) {
        // do something
    }
}
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
Jack
  • 4,684
  • 2
  • 29
  • 22
  • 14
    This is the cleanest approach, in my subjective opinion. – primo Oct 21 '13 at 09:41
  • 6
    Any idea in terms of performance (compared to `string.Split` or `Regex.Split`)? – Uwe Keim Jan 25 '19 at 07:49
  • I like this solution a lot, but I found a minor problem: when the last line is empty, it's ignored (only the last one). So, `"example"` and `"example\r\n"` will both produce only one line while `"example\r\n\r\n"` will produce two lines. This behavior is discussed here: https://github.com/dotnet/runtime/issues/27715 – Alielson Piffer Jan 28 '22 at 21:04
79

Update: See here for an alternative/async solution.


This works great and is faster than Regex:

input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)

It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

Regex.Split(input, "\r\n|\r|\n")

Regex.Split(input, "\r?\n|\r")

Except that Regex turns out to be about 10 times slower. Here's my test:

Action<Action> measure = (Action func) => {
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++) {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
);

measure(() =>
    Regex.Split(input, "\r\n|\r|\n")
);

measure(() =>
    Regex.Split(input, "\r?\n|\r")
);

Output:

00:00:03.8527616

00:00:31.8017726

00:00:32.5557128

and here's the Extension Method:

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        return str.Split(new[] { "\r\n", "\r", "\n" },
            removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
orad
  • 15,272
  • 23
  • 77
  • 113
  • Please add some more details to make your answer more useful for readers. – Mohit Jain Aug 08 '14 at 04:47
  • Done. Also added a test to compare its performance with Regex solution. – orad Aug 08 '14 at 18:50
  • Somewhat faster pattern due to less backtracking with the same functionality if one uses `[\r\n]{1,2}` – ΩmegaMan Feb 27 '15 at 17:23
  • @OmegaMan That has some different behavior. It will match `\n\r` or `\n\n` as single line-break which is not correct. – orad Feb 27 '15 at 22:13
  • @orad I won't argue with you, but if the data has line feeds in multiple numbers...there most likely is something wrong with the data; let us call it an edge case. – ΩmegaMan Feb 28 '15 at 01:02
  • 3
    @OmegaMan How is `Hello\n\nworld\n\n` an edge case? It is clearly one line with text, followed by an empty line, followed by another line with text, followed by an empty line. – Brandin Aug 09 '15 at 10:59
37

You could use Regex.Split:

string[] tokens = Regex.Split(input, @"\r?\n|\r");

Edit: added |\r to account for (older) Mac line terminators.

Wolf
  • 9,679
  • 7
  • 62
  • 108
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
11

If you want to keep empty lines just remove the StringSplitOptions.

var result = input.Split(System.Environment.NewLine.ToCharArray());
Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
7
string[] lines = input.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
Simon Mattes
  • 4,866
  • 2
  • 33
  • 53
MAG TOR
  • 129
  • 1
  • 3
5

I had this other answer but this one, based on Jack's answer, is significantly faster might be preferred since it works asynchronously, although slightly slower.

public static class StringExtensionMethods
{
    public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
    {
        using (var sr = new StringReader(str))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
                {
                    continue;
                }
                yield return line;
            }
        }
    }
}

Usage:

input.GetLines()      // keeps empty lines

input.GetLines(true)  // removes empty lines

Test:

Action<Action> measure = (Action func) =>
{
    var start = DateTime.Now;
    for (int i = 0; i < 100000; i++)
    {
        func();
    }
    var duration = DateTime.Now - start;
    Console.WriteLine(duration);
};

var input = "";
for (int i = 0; i < 100; i++)
{
    input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
}

measure(() =>
    input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
);

measure(() =>
    input.GetLines()
);

measure(() =>
    input.GetLines().ToList()
);

Output:

00:00:03.9603894

00:00:00.0029996

00:00:04.8221971

orad
  • 15,272
  • 23
  • 77
  • 113
  • 2
    I do wonder if this is because you aren't actually inspecting the results of the enumerator, and therefore it isn't getting executed. Unfortunately, I'm too lazy to check. – James Holwell Oct 19 '17 at 16:54
  • Yes, it actually is!! When you add .ToList() to both the calls, the StringReader solution is actually slower! On my machine it is 6.74s vs. 5.10s – JCH2k Nov 02 '17 at 12:20
  • That makes sense. I still prefer this method because it lets me to get lines asynchronously. – orad Nov 06 '17 at 04:41
  • Maybe you should remove the "better solution" header on your other answer and edit this one... – JCH2k Nov 06 '17 at 09:22
2

Slightly twisted, but an iterator block to do it:

public static IEnumerable<string> Lines(this string Text)
{
    int cIndex = 0;
    int nIndex;
    while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
    {
        int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
        yield return Text.Substring(sIndex, nIndex - sIndex);
        cIndex = nIndex;
    }
    yield return Text.Substring(cIndex + 1);
}

You can then call:

var result = input.Lines().ToArray();
JDunkerley
  • 12,355
  • 5
  • 41
  • 45
2
    private string[] GetLines(string text)
    {

        List<string> lines = new List<string>();
        using (MemoryStream ms = new MemoryStream())
        {
            StreamWriter sw = new StreamWriter(ms);
            sw.Write(text);
            sw.Flush();

            ms.Position = 0;

            string line;

            using (StreamReader sr = new StreamReader(ms))
            {
                while ((line = sr.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
            sw.Close();
        }



        return lines.ToArray();
    }
John Thompson
  • 386
  • 3
  • 10
  • This worked really well for parsing a custom file format I wrote. Your code is much faster reading 500+ lines compared to string.Split - big difference! Thanks! – WLFree Oct 07 '22 at 20:25
2

It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n, \x0A, \u000A), "Carriage Return" (ASCII 13, \r, \x0D, \u000D), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A, so this combination should only emit a single line. Unix uses a single \u000A, and very old Macs used a single \u000D character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:

  • each and every CR or LF character should skip to the next line EXCEPT...
  • ...if a CR is immediately followed by LF (\u000D\u000A) then these two together skip just one line.
  • String.Empty is the only input that returns no lines (any character entails at least one line)
  • The last line must be returned even if it has neither CR nor LF.

The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty.

/// <summary>
/// Enumerates the text lines from the string.
///   ⁃ Mixed CR-LF scenarios are handled correctly
///   ⁃ String.Empty is returned for each empty line
///   ⁃ No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
    int j = 0, c, i;
    char ch;
    if ((c = s.Length) > 0)
        do
        {
            for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
                ;

            yield return s.Substring(i, j - i);
        }
        while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}

Note: If you don't mind the overhead of creating a StringReader instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.

public static IEnumerable<String> Lines(this String s)
{
    using (var tr = new StringReader(s))
        while (tr.ReadLine() is String L)
            yield return L;
}
Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
2

Split a string into lines without any allocation.

public static LineEnumerator GetLines(this string text) {
    return new LineEnumerator( text.AsSpan() );
}

internal ref struct LineEnumerator {

    private ReadOnlySpan<char> Text { get; set; }
    public ReadOnlySpan<char> Current { get; private set; }

    public LineEnumerator(ReadOnlySpan<char> text) {
        Text = text;
        Current = default;
    }

    public LineEnumerator GetEnumerator() {
        return this;
    }

    public bool MoveNext() {
        if (Text.IsEmpty) return false;

        var index = Text.IndexOf( '\n' ); // \r\n or \n
        if (index != -1) {
            Current = Text.Slice( 0, index + 1 );
            Text = Text.Slice( index + 1 );
            return true;
        } else {
            Current = Text;
            Text = ReadOnlySpan<char>.Empty;
            return true;
        }
    }


}
Denis535
  • 3,407
  • 4
  • 25
  • 36
2

late to the party, but I've been using a simple collection of extension methods for just that, which leverages TextReader.ReadLine():

public static class StringReadLinesExtension
{
    public static IEnumerable<string> GetLines(this string text) => GetLines(new StringReader(text));
    public static IEnumerable<string> GetLines(this Stream stm) => GetLines(new StreamReader(stm));
    public static IEnumerable<string> GetLines(this TextReader reader) {
        string line;
        while ((line = reader.ReadLine()) != null)
            yield return line;
        reader.Dispose();
        yield break;
    }
}

Using the code is really trivial:

// If you have the text as a string...
var text = "Line 1\r\nLine 2\r\nLine 3";
foreach (var line in text.GetLines())
    Console.WriteLine(line);
// You can also use streams like
var fileStm = File.OpenRead("c:\tests\file.txt");
foreach(var line in fileStm.GetLines())
    Console.WriteLine(line);

Hope this helps someone out there.

Loudenvier
  • 8,362
  • 6
  • 45
  • 66