4

Consider the following program:

#include <iostream>
#include <regex>

int main(int argc, char* argv[]) {
  if (argc==4)
    std::cout << std::regex_replace(
        argv[1], std::regex(argv[2]), argv[3]
      ) << std::endl;
}

Running

./a.out a_a_a '[^_]+$' b

gives the expected result a_a_b. But running

./a.out a_a_a '[^_]*$' b

prints a_a_bb.

boost::regex_replace has the same behavior.

I don't understand why the empty string after the last a gets matched again, when I've already consumed $.

SU3
  • 5,064
  • 3
  • 35
  • 66
  • I think it is because of `*`. Since it maches 0 or 1, thus first maches nothing and put a **b** and then maches that b and puts the second **b**. Make sure you are using gcc 7.1.0 or clang 3.0 or higher – Shakiba Moshiri Sep 08 '17 at 06:33
  • 1
    It is a known behavior for a lot of regex flavors. If you want to match *something* make sure your pattern does not match an empty string. Or at least anchor both at start and end. – Wiktor Stribiżew Sep 08 '17 at 06:34
  • 1
    Can someone explain the rational for this design? I fail to see how `$` serves it's role, if there's still something left to match after it's consumed. – SU3 Sep 08 '17 at 06:39

3 Answers3

1

It is simple difference between * quantifier and + quantifier. The * matches the end letter a as well as a zero-width at the end.

You can see it here:

[^_]*$

Not only it matches the last a but also matches the zero-width after that, and thus the result would be a_a_bb


For being sure about how it works in this way try:

[^_]*

and if you feed the program a_a_a the output would be:

bb_bb_bb

[^_]*


Note that the pattern [^_] matches all three as but as soon as you put an asterisk * after this pattern, it makes the pattern: matches a single a or nothing (= zero-width) and thus the pattern [^_]* against the subject a_a_a matches 6 points: a and between a and _ and so on.

a_a_a
^^^^^^
Community
  • 1
  • 1
Shakiba Moshiri
  • 21,040
  • 2
  • 34
  • 44
  • My misunderstanding was in that I thought that for `.*` to match *nothing*, that *nothing* had to be in a place that's not immediately after *something* this same pattern had just matched. That's how `sed` behaves. – SU3 Sep 08 '17 at 07:11
  • `sed 's/[^_]*$/b/' <<< a_a_a` prints `a_a_b`. – SU3 Sep 08 '17 at 07:17
  • There is a lot of regex flavor and `sed` supports basic(s) not a rich-flexible. [SEE my answer for this question](https://stackoverflow.com/questions/46087665/std-regex-search-to-match-only-current-line/46098368#46098368) about C++ regex library – Shakiba Moshiri Sep 08 '17 at 07:20
1

Anchors don't get consumed (since they are 0-width).

You could try making the pattern abc$$$ to match the string abc and it will still match, as would the pattern ^^^abc. Thus, the $ in your function doesn't get consumed, and allows both a$ and (empty)$ to match.

Jerry
  • 70,495
  • 13
  • 100
  • 144
0

I think because

+ means 1 or many (at least one occurrence for the match to succeed)
* means 0 or many (the match succeeds regardless of the presence of the search string)

So, [^_]+$ matches only a while [^_]*$ matches a and empty-character, so it makes a double b.

GAVD
  • 1,977
  • 3
  • 22
  • 40