Improving regex to match each line in a file, including windows and/or linux line break, even for missing line break at EOF

Question

My requirement is to match each line of a text file, including the line terminator of each, at most excluding the terminator of the last line, to take into account the crippled, non POSIX-compiant files generated on Windows; each line terminator can be either \n or \r\n.

And I'm looking for the best regex, performance-wise.

The first regex I could come up with is this:

\n|\r\n|[^\r\n]++(\r\n|\n)?

A few comments on it:

since three alternatives cannot match at the same place, I guess the order of the alternatives is irrelevant, regardless of the engine being a DFA or NFA;
the ++ instead of + should save some memory, but not some time, as backtracking shouldn't occur.

From Code Review, a suggestion was to use .*(\r?\n|$) (or [^\r\n]*(\r?\n|$), if . also matches \n o \r), but this has a flaw: it also matches the empty string at the end of the file.

That suggested regex can be improved like this:

(?=.).*(\r?\n)?

where the lookahead guarantees that there's at least one character matched by .* and (\r?\n)? together, which prevents the emtpy string at the end of the file from matching.

Which of the two regexes above should be better, performance-wise? Is there any other better way to match as per my requirements?

Please, if you use the ^/$ anchors or similar, comment about that, because their behavior is dependent on whether the engine considers them as multiline by default.

Are you using PCRE regex library? If yes, you'd better use `(?m)^.*$\R?` — Wiktor Stribiżew, Sep 18 '20 at 17:55
Woow, this seems nice! Yes, I think the regex library that I'm using is PCRE. I'll check tomorrow. However I'm pretty sure I verified that `^` ad `$` are `m`ultiline by default, which is suspicious. — Enlico, Sep 18 '20 at 18:11
@WiktorStribiżew yes, on `std::string`s I'm using `boost::regex`. — Enlico, Sep 18 '20 at 20:09
I suspect that `boost::regex expression(R"((?-s)^(?!\z).*$\R?)");` will be best. — Wiktor Stribiżew, Sep 18 '20 at 20:14

score 1 · Accepted Answer · answered Sep 19 '20 at 13:31

1

The best performance in regex is achieved when each subsequent pattern cannot match at the same locationin the string. . and \R are opposite patterns, . is used to match any char but line break chars, and \R is used to match any line break sequence.

In context of C++ Boost regex, where a . matches any char including line break chars and ^ and $ anchors are line (not string) "terminators", the pattern you may consider using is

(?-s)^(?!\z).*\R?

See the regex demo. Details:

(?-s) - turning singleline mode off, the . will now fail to match line break chars
^ - start of a line (boost::regex syntax does not require (?m) to make ^ line-aware, it is the default behavior)
(?!\z) - a negative lookahead that fails the match if the current position is at the very end of string
.* - any zero or more chars other than line break chars, as many as possible (this pattern moves the regex index right at the end of the line)
\R? - an optional line break sequence.

Here is a C++ boost::regex demo:

#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  std::string text = "Line1\nLine2\r\nLine3\rLastLine\n";
  boost::regex expression(R"((?-s)^(?!\z).*\R?)");
  boost::smatch match;
  boost::sregex_token_iterator iter(text.begin(), text.end(), expression, 0);
  boost::sregex_token_iterator end;
  for( ; iter != end; ++iter ) {
   std::cout << "'" << *iter << "'" << std::endl;
  }
  return 0;
}

Output:

'Line1
'
'Line2
'
'Line3
'
'LastLine
'

answered Sep 19 '20 at 13:31

Wiktor Stribiżew

607,720
39
448
563

Would `.*+`instead of `.*`improve the regex' space and/or time performance? – Enlico Sep 19 '20 at 13:42
@EnricoMariaDeAngelis As I explained in the beginning, it should not be necessary as `.` and `\R` are opposite patterns and thus cannot match the same text at the same position. You surely can use `.*+`, it should not affect the peformance in theory in this case. – Wiktor Stribiżew Sep 19 '20 at 13:48
are you saying that even without the plus, the regex'engine will not save states for that piece of match? – Enlico Sep 19 '20 at 13:57
@EnricoMariaDeAngelis You may see [how the regex works here](https://regex101.com/r/ke822x/2/debugger), you can see there is no backtracking. – Wiktor Stribiżew Sep 19 '20 at 13:59
Well, to be honest, since [`ab?c` against _abc_](https://regex101.com/r/qgSVzG/2) and [`ab??c` against _abc_](https://regex101.com/r/qgSVzG/3) both show the same result of 1 match in 4 steps, and also the same steps in the debugger, it seems to me that regex101's debugging tool doesn't really look deeply into the regex engine. In fact, in the latter case the engine has to backtrack once, because after matching `a` against _a_, it first tries to skip the lazy `b??`. Therefore, I'd say that regex101 is not telling me whether `.*+` is or isn't a better choice than `.*`, nor why. – Enlico Sep 20 '20 at 18:50
In making the hypothesis that `.*+` saves some memory with respect to `.*`, I'm trying to put in practice what I'm learning from [Mastering Regular Expressions](https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/). – Enlico Sep 20 '20 at 18:51
@EnricoMariaDeAngelis The lazy and greedy quantifiers behave differently, but possessive and greedy in the current case behave the same. You can't compare these two scenarios. – Wiktor Stribiżew Sep 20 '20 at 19:34
My point was that regex101, in terms of backtracking, doesn't _even_ see the difference between lazy and greedy, which is maybe more obvious; so why should I consider it when comparing possessive and non possessive quantifiers? In other words, I do see that using `.*` or `.*+` results in the same match, but I don't see anything about which states one or the other regex saved. – Enlico Sep 20 '20 at 19:39
Because I explained to you why and how it matches. regex101 is just to show how matching is done (not a proof). – Wiktor Stribiżew Sep 20 '20 at 19:41
In this case, I haven't got the explanation. With reference to `.*`, for instance, you write _this pattern moves the regex index right at the end of the line_. This means that you are not considering that it actually has to "stop" and save a state at every character it eats. And that's the states I'm referring to, which `.*+` wouldn't save. – Enlico Sep 20 '20 at 19:47
If your point is instead that the engine is so smart that it knows that in this case `.*` will not have to give up, and so it doesn't save the states along the way, then I have undestood it only now. But still I wouldn't know where I could read that boost implementation does this smart thing. – Enlico Sep 20 '20 at 19:49
@EnricoMariaDeAngelis You must be missing the understanding what greedy means in regex. Please read [this answer](https://stackoverflow.com/questions/33869557/can-i-improve-performance-of-this-regular-expression-further/33869801#33869801) of mine. You can't expect "*it actually has to "stop" and save a state at every character it eats*". It just doesn't. `.*\R?` involves no backtracking because it is analogous to `a*b?`. `a` does not match `b` and `b` does not match `a`. No backtracking possible. If it were `za*b?1`, yes, there could be backtracking. – Wiktor Stribiżew Sep 20 '20 at 19:50
Quoting from there: _`A*` Zero or more `A`s, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)_. How can it give back character by character (in case the engine needs it), if it hasn't saved anything while matching all the characters to the end of the line? – Enlico Sep 20 '20 at 19:54
I'm not saying `.*` backtracks in this example. I'm saying it has to be ready to backtrack, whereas `.*+` has not, so the former has to save something along the way, the latter hasn't. This is what I think and understand (even from your answer you linked). – Enlico Sep 20 '20 at 19:57
Both `.*` and `.*+` "save" the matched string into the match memory buffer in the same way. – Wiktor Stribiżew Sep 20 '20 at 19:59
Ok, maybe my English is not enough this evening, as it's not my native language. I'll probably ask a separate, more targeted question which links to this. Unrelated to this, tomorrow I'll try the boost solution you propose, which seems unquestionably better than the one I showed in my question. – Enlico Sep 20 '20 at 20:01
Backtracking needs no "saved positions", all it does is reverses the iterator if needed. In the expression I posted, this just does not happen. – Wiktor Stribiżew Sep 20 '20 at 20:07
Wiktor, where do I find the metasequences as you did? E.g., where do I find `\R` or `\z` documented for boost? – Enlico Sep 21 '20 at 16:01
1

See [Boost regex reference](https://www.boost.org/doc/libs/1_74_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html). – Wiktor Stribiżew Sep 21 '20 at 16:05

Improving regex to match each line in a file, including windows and/or linux line break, even for missing line break at EOF

1 Answers1