My requirement is to match each line of a text file, including the line terminator of each, at most excluding the terminator of the last line, to take into account the crippled, non POSIX-compiant files generated on Windows; each line terminator can be either \n
or \r\n
.
And I'm looking for the best regex, performance-wise.
The first regex I could come up with is this:
\n|\r\n|[^\r\n]++(\r\n|\n)?
A few comments on it:
- since three alternatives cannot match at the same place, I guess the order of the alternatives is irrelevant, regardless of the engine being a DFA or NFA;
- the
++
instead of+
should save some memory, but not some time, as backtracking shouldn't occur.
From Code Review, a suggestion was to use .*(\r?\n|$)
(or [^\r\n]*(\r?\n|$)
, if .
also matches \n
o \r
), but this has a flaw: it also matches the empty string at the end of the file.
That suggested regex can be improved like this:
(?=.).*(\r?\n)?
where the lookahead guarantees that there's at least one character matched by .*
and (\r?\n)?
together, which prevents the emtpy string at the end of the file from matching.
Which of the two regexes above should be better, performance-wise? Is there any other better way to match as per my requirements?
Please, if you use the ^
/$
anchors or similar, comment about that, because their behavior is dependent on whether the engine considers them as multiline by default.