34

I have a data stream that may contain \r, \n, \r\n, \n\r or any combination of them. Is there a simple way to normalize the data to make all of them simply become \r\n pairs to make display more consistent?

So something that would yield this kind of translation table:

\r     --> \r\n
\n     --> \r\n
\n\n   --> \r\n\r\n
\n\r   --> \r\n
\r\n   --> \r\n
\r\n\n --> \r\n\r\n
Chris
  • 6,761
  • 6
  • 52
  • 67
ctacke
  • 66,480
  • 18
  • 94
  • 155
  • 2
    Wait, so you want \n\r to map to \r\n? That's not normalization. No common platform uses \n\r as a line ending. – Derek Park Sep 26 '08 at 18:19
  • 5
    Didn't say it way a platform normal, now did I? I've seen data (from VB code specifically)code that has it that way, and I need to account for it. Sorry if that doesn't meet the strict definition of "normalize" but is certainly meets the definition of the data I need to process, which is the point – ctacke Oct 29 '08 at 03:19

8 Answers8

43

I believe this will do what you need:

using System.Text.RegularExpressions;
// ...
string normalized = Regex.Replace(originalString, @"\r\n|\n\r|\n|\r", "\r\n");

I'm not 100% sure on the exact syntax, and I don't have a .Net compiler handy to check. I wrote it in perl, and converted it into (hopefully correct) C#. The only real trick is to match "\r\n" and "\n\r" first.

To apply it to an entire stream, just run it on chunks of input. (You could do this with a stream wrapper if you want.)


The original perl:

$str =~ s/\r\n|\n\r|\n|\r/\r\n/g;

The test results:

[bash$] ./test.pl
\r -> \r\n
\n -> \r\n
\n\n -> \r\n\r\n
\n\r -> \r\n
\r\n -> \r\n
\r\n\n -> \r\n\r\n

Update: Now converts \n\r to \r\n, though I wouldn't call that normalization.

Petter Hesselberg
  • 5,062
  • 2
  • 24
  • 42
Derek Park
  • 45,824
  • 15
  • 58
  • 76
  • This did not meet the requirements of the above example in the table.. Look at the regex I modified, you need to account for \n\n. – Quintin Robinson Sep 26 '08 at 18:16
  • This one is close, but \n\r should simply swap the elements to be a \r\n (saw this input from a VB developer's code) – ctacke Sep 26 '08 at 18:18
  • Ok, made that change. I wouldn't consider that normalization, but it's easy enough to add to the regex. – Derek Park Sep 26 '08 at 18:22
  • 4
    You will need to remove the '@' from the replacement string. If you don't it will replace '\r\n' with '\\r\\n' because you are asking for the literal string "\r\n". Even better would be to replace with the Environment.NewLine constant. – NerdFury Sep 26 '08 at 18:33
  • 1
    Thanks for catching that, NerdFury. I removed the @ from the replacement string. I would change it to the NewLine constant, but since he specifically asked for "\r\n", I figure I should leave that alone. – Derek Park Sep 26 '08 at 18:42
  • What is about performance and RegExpr ? Maybe using Regex Timeout (new in .net 4.5) , RegexMatchTimeoutException, etc – Kiquenet Jun 19 '13 at 11:46
  • There is a better way now, see my other answer https://stackoverflow.com/a/75133341/475727 – Liero Jan 16 '23 at 15:09
15

I'm with Jamie Zawinski on RegEx:

"Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems"

For those of us who prefer readability:

  • Step 1

    Replace \r\n by \n

    Replace \n\r by \n (if you really want this, some posters seem to think not)

    Replace \r by \n

  • Step 2 Replace \n by Environment.NewLine or \r\n or whatever.

Joe
  • 122,218
  • 32
  • 205
  • 338
9

since .NET 6 it is supported out of the box:

string normalized = originalString.ReplaceLineEndings(); //uses Environment.NewLine

string normalized = originalString.ReplaceLineEndings("\r\n");

see https://github.com/dotnet/runtime/blob/a879885975b5498db559729811304888463c15ed/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs#L1183

InvertedAcceleration
  • 10,695
  • 9
  • 46
  • 71
Liero
  • 25,216
  • 29
  • 151
  • 297
3

A Regex would help.. could do something roughly like this..

(\r\n|\n\n|\n\r|\r|\n) replace with \r\n

This regex produced these results from the table posted (just testing left side) so a replace should normalize.

\r   => \r 
\n   => \n 
\n\n => \n\n 
\n\r => \n\r 
\r\n => \r\n 
\r\n => \r\n 
\n   => \n 
Quintin Robinson
  • 81,193
  • 14
  • 123
  • 132
3

Normalise breaks, so that they are all \r\n

var normalisedString =
            sourceString
            .Replace("\r\n", "\n")
            .Replace("\n\r", "\n")
            .Replace("\r", "\n")
            .Replace("\n", "\r\n");
Draken
  • 3,134
  • 13
  • 34
  • 54
Phil
  • 1,062
  • 1
  • 17
  • 15
3

It's a two step process.
First you convert all the combinations of \r and \n into a single one, say \r
Then you convert all the \r into your target \r\n

normalized = 
    original.Replace("\r\n", "\r").
             Replace("\n\r", "\r").
             Replace("\n", "\r").
             Replace("\r", "\r\n"); // last step
GDavoli
  • 517
  • 4
  • 8
2

You're thinking too complicated. Ignore every \r and turn every \n into an \r\n.

In Pseudo-C#:

char[] chunk = new char[X];
StringBuffer output = new StringBuffer();

buffer.Read(chunk);
foreach (char c in chunk)
{
   switch (c)
   {
      case '\r' : break; // ignore
      case '\n' : output.Append("\r\n");
      default   : output.Append(c);
   }
 }

EDIT: \r alone is no line-terminator so I doubt you really want to expand \r to \r\n.

VVS
  • 19,405
  • 5
  • 46
  • 65
  • 1
    He wants standalone \r to turn into \r\n as well. – Derek Park Sep 26 '08 at 18:05
  • Hm. Can't believe he really wants that :) – VVS Sep 26 '08 at 18:09
  • 5
    Macs used CR for linebreaks up to MacOS 9. It's \n\r that surprises me. – Steve Jessop Sep 26 '08 at 18:32
  • Pre-MacOS X Macs and some 8-bit systems back in the '80s used CR. MacOS X uses LF like any other Unix system – Ken Keenan Nov 13 '17 at 09:54
  • Just to point out that we are porting various codes to various languages and platforms. Codes are written by hundreds of humans, some of them may be dead now. And the amount of typos "\n\r" is just... incredibly high... It's an edge case yes, not everybody uses logic from other languages involving mass string data.. I guess, but the "\n\r" check in our case is 1000% worth it. – Karl Stephen Jun 15 '21 at 11:37
0

This is the answer to the question. The given solution replaces a string by the given translation table. It does not use an expensive regex function. It also does not use multiple replacement functions that each individually did loop over the data with several checks etc.

So the search is done directly in 1 for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops. In some cases, a larger page size might be more efficient.

public static string NormalizeNewLine(this string val)
{
    if (string.IsNullOrEmpty(val))
        return val;

    const int page = 6;
    int a = page;
    int j = 0;
    int len = val.Length;
    char[] res = new char[len];

    for (int i = 0; i < len; i++)
    {
        char ch = val[i];

        if (ch == '\r')
        {
            int ni = i + 1;
            if (ni < len && val[ni] == '\n')
            {
                res[j++] = '\r';
                res[j++] = '\n';
                i++;
            }
            else
            {
                if (a == page) //ensure capacity
                {
                    char[] nres = new char[res.Length + page];
                    Array.Copy(res, 0, nres, 0, res.Length);
                    res = nres;
                    a = 0;
                }

                res[j++] = '\r';
                res[j++] = '\n';
                a++;
            }
        }
        else if (ch == '\n')
        {
            int ni = i + 1;
            if (ni < len && val[ni] == '\r')
            {
                res[j++] = '\r';
                res[j++] = '\n';
                i++;
            }
            else
            {
                if (a == page) //ensure capacity
                {
                    char[] nres = new char[res.Length + page];
                    Array.Copy(res, 0, nres, 0, res.Length);
                    res = nres;
                    a = 0;
                }

                res[j++] = '\r';
                res[j++] = '\n';
                a++;
            }
        }
        else
        {
            res[j++] = ch;
        }
    }

    return new string(res, 0, j);
}

The translation table really appeals to me even if '\n\r' is not actually used on basic platforms. Who would use two types of linebreaks for indicate 2 linebreaks? If you want to know that, than you need to take a look before to know if the \n and \r both are used seperatly in the same document.

Roberto B
  • 542
  • 5
  • 13
  • This array copying to resize it has the potential to create a lot of garbage. – CodeCaster May 15 '18 at 07:58
  • This code is based on stringbuilder Replace function. Source: https://referencesource.microsoft.com/#mscorlib/system/text/stringbuilder.cs Ensure capacity is also based on Capacity property of List. Source: https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs – Roberto B May 15 '18 at 08:16
  • That is an awful lot of code to replace a very simple regex. Not sure why you assume the regex would be "expensive", the cases in which a regex is slower than code you would write yourself are pretty rare. – mhenry1384 Jul 12 '18 at 12:51
  • 1
    An awful lot of code... Maybe not. Have you ever looked at a regex compilation? You can do this with Regex.CompileToAssembly(... Read: https://blog.maartenballiauw.be/post/2017/04/24/making-string-validation-faster-no-regular-expressions.html This seems to me as a frequently called function and then is good to go for performance. – Roberto B Jul 16 '18 at 12:27