Normalize newlines in C#

Question

I have a data stream that may contain \r, \n, \r\n, \n\r or any combination of them. Is there a simple way to normalize the data to make all of them simply become \r\n pairs to make display more consistent?

So something that would yield this kind of translation table:

\r     --> \r\n
\n     --> \r\n
\n\n   --> \r\n\r\n
\n\r   --> \r\n
\r\n   --> \r\n
\r\n\n --> \r\n\r\n

Wait, so you want \n\r to map to \r\n? That's not normalization. No common platform uses \n\r as a line ending. — Derek Park, Sep 26 '08 at 18:19
Didn't say it way a platform normal, now did I? I've seen data (from VB code specifically)code that has it that way, and I need to account for it. Sorry if that doesn't meet the strict definition of "normalize" but is certainly meets the definition of the data I need to process, which is the point — ctacke, Oct 29 '08 at 03:19

score 43 · Accepted Answer · edited Jan 16 '23 at 12:15

43

I believe this will do what you need:

using System.Text.RegularExpressions;
// ...
string normalized = Regex.Replace(originalString, @"\r\n|\n\r|\n|\r", "\r\n");

I'm not 100% sure on the exact syntax, and I don't have a .Net compiler handy to check. I wrote it in perl, and converted it into (hopefully correct) C#. The only real trick is to match "\r\n" and "\n\r" first.

To apply it to an entire stream, just run it on chunks of input. (You could do this with a stream wrapper if you want.)

The original perl:

$str =~ s/\r\n|\n\r|\n|\r/\r\n/g;

The test results:

[bash$] ./test.pl
\r -> \r\n
\n -> \r\n
\n\n -> \r\n\r\n
\n\r -> \r\n
\r\n -> \r\n
\r\n\n -> \r\n\r\n

Update: Now converts \n\r to \r\n, though I wouldn't call that normalization.

edited Jan 16 '23 at 12:15

Petter Hesselberg

5,062
2
24
42

answered Sep 26 '08 at 18:14

Derek Park

45,824
15
58
76

This did not meet the requirements of the above example in the table.. Look at the regex I modified, you need to account for \n\n. – Quintin Robinson Sep 26 '08 at 18:16
This one is close, but \n\r should simply swap the elements to be a \r\n (saw this input from a VB developer's code) – ctacke Sep 26 '08 at 18:18
Ok, made that change. I wouldn't consider that normalization, but it's easy enough to add to the regex. – Derek Park Sep 26 '08 at 18:22
4

You will need to remove the '@' from the replacement string. If you don't it will replace '\r\n' with '\\r\\n' because you are asking for the literal string "\r\n". Even better would be to replace with the Environment.NewLine constant. – NerdFury Sep 26 '08 at 18:33
1

Thanks for catching that, NerdFury. I removed the @ from the replacement string. I would change it to the NewLine constant, but since he specifically asked for "\r\n", I figure I should leave that alone. – Derek Park Sep 26 '08 at 18:42
What is about performance and RegExpr ? Maybe using Regex Timeout (new in .net 4.5) , RegexMatchTimeoutException, etc – Kiquenet Jun 19 '13 at 11:46
There is a better way now, see my other answer https://stackoverflow.com/a/75133341/475727 – Liero Jan 16 '23 at 15:09

Joe · Answer 2 · 2014-04-16T08:48:06.087

15

I'm with Jamie Zawinski on RegEx:

"Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems"

For those of us who prefer readability:

Step 1

Replace \r\n by \n

Replace \n\r by \n (if you really want this, some posters seem to think not)

Replace \r by \n
Step 2 Replace \n by Environment.NewLine or \r\n or whatever.

edited Apr 16 '14 at 08:48

answered Sep 26 '08 at 21:47

Joe

122,218
32
205
338

10

This is a trivial regex. I would agree with you if it were HTML parsing. – cchamberlain Aug 11 '15 at 23:38
1

@cchamberlain you mean https://stackoverflow.com/a/1732454/461444 ? :D – AFract Mar 16 '21 at 16:46
@AFract yes. : – cchamberlain Mar 23 '21 at 19:43

score 9 · Answer 3 · edited Jul 24 '23 at 12:56

9

since .NET 6 it is supported out of the box:

string normalized = originalString.ReplaceLineEndings(); //uses Environment.NewLine

string normalized = originalString.ReplaceLineEndings("\r\n");

see https://github.com/dotnet/runtime/blob/a879885975b5498db559729811304888463c15ed/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs#L1183

edited Jul 24 '23 at 12:56

InvertedAcceleration

10,695
9
46
71

answered Jan 16 '23 at 11:05

Liero

25,216
29
151
297

Quintin Robinson · Answer 4 · 2008-09-26T18:02:14.637

3

A Regex would help.. could do something roughly like this..

(\r\n|\n\n|\n\r|\r|\n) replace with \r\n

This regex produced these results from the table posted (just testing left side) so a replace should normalize.

\r   => \r 
\n   => \n 
\n\n => \n\n 
\n\r => \n\r 
\r\n => \r\n 
\r\n => \r\n 
\n   => \n

edited Sep 26 '08 at 18:02

answered Sep 26 '08 at 17:53

Quintin Robinson

81,193
14
123
132

Except if it containe \r\n already, the replacement would expand that to \r\n\r\n. Same for \n\r. I believe the answer is in the arcane language of regex, but it's a black art to me. – ctacke Sep 26 '08 at 17:56
CQ, that doesn't do what he asked for. A regex might work, but not as you've posted it. – Derek Park Sep 26 '08 at 17:56
Agreed I did not account of existing \r\n – Quintin Robinson Sep 26 '08 at 17:57
That is why I said roughly though, a little tweaking like preeceding an \r\n might resolve this. – Quintin Robinson Sep 26 '08 at 17:59

score 3 · Answer 5 · edited Jan 17 '17 at 13:43

3

Normalise breaks, so that they are all \r\n

var normalisedString =
            sourceString
            .Replace("\r\n", "\n")
            .Replace("\n\r", "\n")
            .Replace("\r", "\n")
            .Replace("\n", "\r\n");

edited Jan 17 '17 at 13:43

Draken

3,134
13
34
54

answered Jan 17 '17 at 12:13

Phil

1,062
1
17
15

score 3 · Answer 6 · answered Oct 11 '20 at 03:23

It's a two step process.
First you convert all the combinations of \r and \n into a single one, say \r
Then you convert all the \r into your target \r\n

normalized = 
    original.Replace("\r\n", "\r").
             Replace("\n\r", "\r").
             Replace("\n", "\r").
             Replace("\r", "\r\n"); // last step

VVS · Answer 7 · 2008-09-26T18:07:59.490

2

You're thinking too complicated. Ignore every \r and turn every \n into an \r\n.

In Pseudo-C#:

char[] chunk = new char[X];
StringBuffer output = new StringBuffer();

buffer.Read(chunk);
foreach (char c in chunk)
{
   switch (c)
   {
      case '\r' : break; // ignore
      case '\n' : output.Append("\r\n");
      default   : output.Append(c);
   }
 }

EDIT: \r alone is no line-terminator so I doubt you really want to expand \r to \r\n.

edited Sep 26 '08 at 18:07

answered Sep 26 '08 at 18:02

VVS

19,405
5
46
65

1

He wants standalone \r to turn into \r\n as well. – Derek Park Sep 26 '08 at 18:05
Hm. Can't believe he really wants that :) – VVS Sep 26 '08 at 18:09
5

Macs used CR for linebreaks up to MacOS 9. It's \n\r that surprises me. – Steve Jessop Sep 26 '08 at 18:32
Pre-MacOS X Macs and some 8-bit systems back in the '80s used CR. MacOS X uses LF like any other Unix system – Ken Keenan Nov 13 '17 at 09:54
Just to point out that we are porting various codes to various languages and platforms. Codes are written by hundreds of humans, some of them may be dead now. And the amount of typos "\n\r" is just... incredibly high... It's an edge case yes, not everybody uses logic from other languages involving mass string data.. I guess, but the "\n\r" check in our case is 1000% worth it. – Karl Stephen Jun 15 '21 at 11:37

Roberto B · Answer 8 · 2018-05-24T18:43:42.807

This is the answer to the question. The given solution replaces a string by the given translation table. It does not use an expensive regex function. It also does not use multiple replacement functions that each individually did loop over the data with several checks etc.

So the search is done directly in 1 for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops. In some cases, a larger page size might be more efficient.

public static string NormalizeNewLine(this string val)
{
    if (string.IsNullOrEmpty(val))
        return val;

    const int page = 6;
    int a = page;
    int j = 0;
    int len = val.Length;
    char[] res = new char[len];

    for (int i = 0; i < len; i++)
    {
        char ch = val[i];

        if (ch == '\r')
        {
            int ni = i + 1;
            if (ni < len && val[ni] == '\n')
            {
                res[j++] = '\r';
                res[j++] = '\n';
                i++;
            }
            else
            {
                if (a == page) //ensure capacity
                {
                    char[] nres = new char[res.Length + page];
                    Array.Copy(res, 0, nres, 0, res.Length);
                    res = nres;
                    a = 0;
                }

                res[j++] = '\r';
                res[j++] = '\n';
                a++;
            }
        }
        else if (ch == '\n')
        {
            int ni = i + 1;
            if (ni < len && val[ni] == '\r')
            {
                res[j++] = '\r';
                res[j++] = '\n';
                i++;
            }
            else
            {
                if (a == page) //ensure capacity
                {
                    char[] nres = new char[res.Length + page];
                    Array.Copy(res, 0, nres, 0, res.Length);
                    res = nres;
                    a = 0;
                }

                res[j++] = '\r';
                res[j++] = '\n';
                a++;
            }
        }
        else
        {
            res[j++] = ch;
        }
    }

    return new string(res, 0, j);
}

The translation table really appeals to me even if '\n\r' is not actually used on basic platforms. Who would use two types of linebreaks for indicate 2 linebreaks? If you want to know that, than you need to take a look before to know if the \n and \r both are used seperatly in the same document.

This array copying to resize it has the potential to create a lot of garbage. — CodeCaster, May 15 '18 at 07:58
This code is based on stringbuilder Replace function. Source: https://referencesource.microsoft.com/#mscorlib/system/text/stringbuilder.cs Ensure capacity is also based on Capacity property of List. Source: https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs — Roberto B, May 15 '18 at 08:16
That is an awful lot of code to replace a very simple regex. Not sure why you assume the regex would be "expensive", the cases in which a regex is slower than code you would write yourself are pretty rare. — mhenry1384, Jul 12 '18 at 12:51
An awful lot of code... Maybe not. Have you ever looked at a regex compilation? You can do this with Regex.CompileToAssembly(... Read: https://blog.maartenballiauw.be/post/2017/04/24/making-string-validation-faster-no-regular-expressions.html This seems to me as a frequently called function and then is good to go for performance. — Roberto B, Jul 16 '18 at 12:27

Normalize newlines in C#

8 Answers8

Linked

Related