294

While writing this answer, I had to match exclusively on linebreaks instead of using the s-flag (dotall - dot matches linebreaks).

The sites usually used to test regular expressions behave differently when trying to match on \n or \r\n.

I noticed

  • Regex101 matches linebreaks only on \n
    (example - delete \r and it matches)

  • RegExr matches linebreaks neither on \n nor on \r\n
    and I can't find something to make it match a linebreak, except for the m-flag and \s
    (example)

  • Debuggex behaves even more different:
    in this example it matches only on \r\n, while
    here it only matches on \n, with the same flags and engine specified

I'm fully aware of the m-flag (multiline - makes ^ match the start and $ the end of a line), but sometimes this is not an option. Same with \s, as it matches tabs and spaces, too.

My thought to use the unicode newline character (\u0085) wasn't successful, so:

  1. Is there a failsafe way to integrate the match on a linebreak (preferably regardless of the language used) into a regular expression?
  2. Why do the above mentioned sites behave differently (especially Debuggex, matching once only on \n and once only on \r\n)?
Community
  • 1
  • 1
KeyNone
  • 8,745
  • 4
  • 34
  • 51
  • 38
    You can try `[\r\n]+` - or something like this – Iłya Bursov Nov 18 '13 at 19:45
  • 9
    I use: `\r?\n` to match both `\r\n` and `\n` line termination sequences. It doesn't work for the old `\r` Mac syntax, but that one is pretty rare these days. – ridgerunner Nov 18 '13 at 20:14
  • 7
    Hey there, I'm the founder of debuggex. This looks like a bug (for debuggex, I can't speak for the others). I've added a high-pri issue referencing this question. We'll get to it as soon as possible - we're currently focusing all of our (very limited) resources on launching another product. – Sergiu Toarca Nov 18 '13 at 22:03
  • 4
    @ridgerunner to add Mac's syntax to that, you could do (\r?\n|\r), which is similar to Peter van der Wal's answer below but more compact (10 chars vs 12 chars). – Doktor J Jul 03 '15 at 16:41

7 Answers7

385

I will answer in the opposite direction.

  1. For a full explanation about \r and \n I have to refer to this question, which is far more complete than I will post here: Difference between \n and \r?

Long story short, Linux uses \n for a new-line, Windows \r\n and old Macs \r. So there are multiple ways to write a newline. Your second tool (RegExr) does for example match on the single \r.

  1. [\r\n]+ as Ilya suggested will work, but will also match multiple consecutive new-lines. (\r\n|\r|\n) is more correct.
Aryan Beezadhur
  • 4,503
  • 4
  • 21
  • 42
Peter van der Wal
  • 11,141
  • 2
  • 21
  • 29
  • So, `\r`/`\n` are depending on the operating system - that's a thing one may know ( ;) ) - but why do the two debuggex-examples match once on \r\n and once on \n? At least there's no difference (in the examples) visible for me. – KeyNone Nov 18 '13 at 20:08
  • Most likely because you copied one out of your windows text editor and the other one you wrote straight into the debuggex textarea. Each used different line breaks. – OGHaza Nov 18 '13 at 20:18
  • 1
    Indeed, because in your third example (the Senior men's...) there is an `\r\n` in the text (if you right-click and show source, you will find `{{Infobox XC Championships\r\n|Name =` somewhere). The second tool is written in Flash and as you read the about-page a bit buggy with newline-characters. – Peter van der Wal Nov 18 '13 at 20:29
  • 1
    `(\r\n|\r|\n)` can be written more simply as `\r\n?` – Asad Saeeduddin Jun 08 '16 at 17:18
  • 10
    @AsadSaeeduddin No it can't. It won't match the Unix line-ending `\n` – Peter van der Wal Jun 08 '16 at 17:42
  • 1
    Whoops, you're right. I meant to add the `?` to `\r`, which is the optional one. It should be `\r?\n`. – Asad Saeeduddin Jun 08 '16 at 18:42
  • 6
    @AsadSaeeduddin That one won't match Mac's single `\r` – Peter van der Wal Jun 08 '16 at 20:23
  • Ah, didn't realize there were platforms with single `\r`. – Asad Saeeduddin Jun 09 '16 at 00:01
  • 1
    @PetervanderWal *Old Mac's single `\r` – Teejay Sep 04 '20 at 07:53
34

In PCRE \R matches \n, \r and \r\n.

Toto
  • 89,455
  • 62
  • 89
  • 125
Cwazy Paving
  • 461
  • 4
  • 4
  • 5
    @Sandwell: Sorry, I don't get you, this is not a question, it is an answer, simpler than `(\r\n|\r|\n)` – Toto May 13 '20 at 14:49
  • This is slick! Confirmed working on Rails 7.0 and Ruby 3.1 (and whatever REGEX parser/matcher they use in it). – Joshua Pinter Oct 05 '22 at 01:29
15

You have different line endings in the example texts in Debuggex. What is especially interesting is that Debuggex seems to have identified which line ending style you used first, and it converts all additional line endings entered to that style.

I used Notepad++ to paste sample text in Unix and Windows format into Debuggex, and whichever I pasted first is what that session of Debuggex stuck with.

So, you should wash your text through your text editor before pasting it into Debuggex. Ensure that you're pasting the style you want. Debuggex defaults to Unix style (\n).

Also, NEL (\u0085) is something different entirely: https://en.wikipedia.org/wiki/Newline#Unicode

(\r?\n) will cover Unix and Windows. You'll need something more complex, like (\r\n|\r|\n), if you want to match old Mac too.

Dane
  • 1,201
  • 8
  • 17
  • Very interesting point about debuggex! Also, thanks for pointing out \u0085, got mislead there! – KeyNone Nov 18 '13 at 21:03
3

In Python:

# as Peter van der Wal's answer
re.split(r'\r\n|\r|\n', text, flags=re.M) 

or more rigorous:

# https://docs.python.org/3/library/stdtypes.html#str.splitlines
str.splitlines()
Keelung
  • 349
  • 5
  • 9
3

Not sure if this is what was asked for:

(somethingToStaMatch)(.|\n)*?(somethingToEndMatch)

This will have 3 groups of matches. And the ALLWITHLINEBREAKS one in the middle. Might help someone tested with dotnet.

string pattern = @"(somethingToStartMatch)(.|\n)*?(somethingToEndMatch)";

Note that *? is allowing to match even if your text has multiple keyword pairs!

rufreakde
  • 542
  • 4
  • 17
2

This only applies to question 1.

I have an app that runs on Windows and uses a multi-line MFC editor box.
The editor box expects CRLF linebreaks, but I need to parse the text enterred
with some really big/nasty regexs'.

I didn't want to be stressing about this while writing the regex, so
I ended up normalizing back and forth between the parser and editor so that
the regexs' just use \n. I also trap paste operations and convert them for the boxes.

This does not take much time.
This is what I use.

 boost::regex  CRLFCRtoLF (
     " \\r\\n | \\r(?!\\n) "
     , MODx);

 boost::regex  CRLFCRtoCRLF (
     " \\r\\n?+ | \\n "
     , MODx);


 // Convert (All style) linebreaks to linefeeds 
 // ---------------------------------------
 void ReplaceCRLFCRtoLF( string& strSrc, string& strDest )
 {
    strDest  = boost::regex_replace ( strSrc, CRLFCRtoLF, "\\n" );
 }

 // Convert linefeeds to linebreaks (Windows) 
 // ---------------------------------------
 void ReplaceCRLFCRtoCRLF( string& strSrc, string& strDest )
 {
    strDest  = boost::regex_replace ( strSrc, CRLFCRtoCRLF, "\\r\\n" );
 }
0

A bit late to the party, but for the rest could be perhaps useful. In javascript you can simply write pipe (|) to match the newlines/linebreaks as well. In my case I needed to get rid of all the commas, semicolons and whitespace characters (linebreaks included) so I ended up using this:

.split(/[\s,;|]+/)

Kepi
  • 116
  • 9