6

Here is code :

#include <string>
#include <regex>
#include <iostream>

int main()
{
    std::string pattern("[^c]ei");
    pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
    std::regex r(pattern); 
    std::smatch results;   
    std::string test_str = "cei";

    if (std::regex_search(test_str, results, r)) 
        std::cout << results.str() << std::endl;      

    return 0;
}

Output :

cei

The compiler used is gcc 4.9.1.

I'm a newbie learning regular expression.I expected nothing should be output,since "cei" doesn't match the pattern here. Am I doing it right? What's the problem?

Update:

This one has been reported and confirmed as a bug, for detail please visit here : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497

aliteralmind
  • 19,847
  • 17
  • 77
  • 108
Yue Wang
  • 1,710
  • 3
  • 18
  • 43
  • @Vajura: `[[:alpha:]]` should be a correct character class. It is mentioned in the C++ reference as an extension to ECMA script. – nhahtdh Oct 09 '14 at 07:48
  • Aside from problems with your code, Regex support in current gcc is very limited. Imho, it's not worth the trouble. Use Boost's Regex and be happy. https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2014 – rubber boots Oct 09 '14 at 07:48
  • Removing the first `[[:alpha:]]*` gives the expected result, I'm not sure what the conflict is. – Niall Oct 09 '14 at 07:48
  • Just some info: on gcc 4.8.x it throws `std::regex_error` at line `std::regex r(pattern);` – Mine Oct 09 '14 at 07:50
  • clang++ works (libc++), gcc fails (stdlibc++) ... http://coliru.stacked-crooked.com/a/fc56ed8c533bda55 – Niall Oct 09 '14 at 07:51
  • thx @rubberboots I'm gonna try `boost` instead. – Yue Wang Oct 09 '14 at 07:52
  • @Alan.W `[[:alpha:]]*` will eat up all characters. Then, there's no `cei`left for the matcher because it's already at the end. With backtracking, the string should match .... – rubber boots Oct 09 '14 at 07:52
  • :S I thought it was an invalid/bad regex.. https://www.myregextester.com/index.php it doesn't work on there.. using `[A-Za-z]*[^c]ei[A-Za-z]*` even on http://regex101.com/ and regexpal, it doesn't work.. so.. Even using just `[^c]ei` it fails. – Brandon Oct 09 '14 at 07:57
  • @Niall I tried Clang++ 3.4, it complained like crazy..Should I use 3.5 instead? – Yue Wang Oct 09 '14 at 07:58
  • @Alan.W, yes, I believe the coliru clang is 3.5.0. – Niall Oct 09 '14 at 07:59
  • @rubberboots: If you read the tables carefully on the link you posted, GCC claims to have full C++11 regex support. – John Zwinck Oct 09 '14 at 08:02
  • This seems to be a bug related to the fixed string optimization, where the engine searches for the fixed string first before evaluating the rest of the expression. – nhahtdh Oct 09 '14 at 08:06
  • The correct pattern is: `(?!c)ei` aka negative lookahead.. I don't understand how all of you guys are deeming this a bug.. His pattern doesn't work in Java. It doesn't work anywhere even if you replace the `[[:alpha:]]` with `[A-Za-z]` http://ideone.com/ksbtAq – Brandon Oct 09 '14 at 08:07
  • @Brandon: Your pattern will always match `ei`. The pattern in the question is a valid pattern according to C++ standard, and the correct behavior is that it won't match `cei`. – nhahtdh Oct 09 '14 at 08:12
  • @JohnZwinck correct, they changed that feature list compared to what I read some time ago. Maybe I'll give it another trial in depth ... Thanks! – rubber boots Oct 09 '14 at 09:04
  • 1
    @Alan.W IMHO Learning Regex w/C++ libraries is too hard and too obfuscating due to the C++ionisms in style and expression. I'd like to advice you to use Perl for regular expression learning and gradually transferring working expressions into C++ later. See this book: http://regex.info/ – rubber boots Oct 09 '14 at 09:11
  • @rubberboots Thx man! I do hear that `perl` is pretty good for using regex, but I have no experience on `perl`. Do you think it's still a better choice even without any experience on `perl`? I was told that `perl` is a quite strange language.. – Yue Wang Oct 09 '14 at 09:24
  • @Alan.W - TO Perl OR NOT TO Perl? This is clearly a question of your capabilities and learning time resources. In order to find out, you could spend one evening at this: http://learn.perl.org/ – rubber boots Oct 09 '14 at 09:29
  • @rubberboots Awesome! I'll try it. `Perl` sounds pretty cool anyway.Thx again. – Yue Wang Oct 09 '14 at 09:37
  • 1
    @Alan.W http://perldoc.perl.org/perlretut.html – rubber boots Oct 09 '14 at 09:44

2 Answers2

4

It's a bug in the implementation. Not only do a couple other tools I tried agree that your pattern does not match your input, but I tried this:

#include <string>
#include <regex>
#include <iostream>

int main()
{
  std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
  std::regex r(pattern);
  std::smatch results;
  std::string test_str = "cei";

  if (std::regex_search(test_str, results, r))
  {
    std::cout << results.str() << std::endl;

    for (size_t i = 0; i < results.size(); ++i) {
      std::ssub_match sub_match = results[i];
      std::string sub_match_str = sub_match.str();
      std::cout << i << ": " << sub_match_str << '\n';
    }
  }
}

This is basically similar to what you had, but I replaced [:alpha:] with [a-z] for simplicity, and I also temporarily replaced [^c] with [a-z] because that seems to make it work correctly. Here's what it prints (GCC 4.9.0 on Linux x86-64):

cei
0: cei
1:
2: c
3: e
4: i
5:

If I replace [a-z] where you had [^c] and just put f there instead, it correctly says the pattern doesn't match. But if I use [^c] like you did:

std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");

Then I get this output:

cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Aborted (core dumped)

So it claims to match successfully, and results[0] is "cei" which is expected. Then, results[1] is "cei" also, which I guess might be OK. But then results[2] crashes, because it tries to construct a std::string of length 18446744073709551614 with begin=nullptr. And that giant number is exactly 2^64 - 2, aka std::string::npos - 1 (on my system).

So I think there is an off-by-one error somewhere, and the impact can be much more than just a spurious regex match--it can crash at runtime.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Since you are mentioning a crash `".{0,3}[^\\s\\S]ei.?"` – nhahtdh Oct 09 '14 at 08:02
  • Thx man! I got exactly the same thing using your code.I guess I need `Clang` or use `boost` instead. – Yue Wang Oct 09 '14 at 08:16
  • 2
    @Alan.W: or just formulate your regex a bit differently. Several expressions I tried which were not too much different than this worked as expected. Do you want to file a GCC bug for this? I think you should. – John Zwinck Oct 09 '14 at 08:24
  • Cool I'll try different pattern. Bug reporting? Can I? How to report such thing? Can you share a link or something that I can learn how to report this bug? – Yue Wang Oct 09 '14 at 08:29
  • 1
    @Alan.W: here you go: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc - you will need to create an account. It's easy enough, don't be scared. :) – John Zwinck Oct 09 '14 at 08:35
  • I found `Bug 61720` is quite similar : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61720. Can you have a look? Should I still report this bug? – Yue Wang Oct 09 '14 at 09:00
  • @Alan.W: I think you should report it. Worst case one fix will correct both bugs; your expressions are a bit different from the other one, plus we have demonstrated that it results in an outright runtime crash which should not be possible as far as I can tell. Please do file a bug and link it in a comment here. – John Zwinck Oct 09 '14 at 10:59
  • Done,it's Bug 63497 : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497 Is it ok? If anything I wrote is not appropriate, please let me know. – Yue Wang Oct 09 '14 at 12:05
  • @Alan.W: looks good to me (not that my opinion matters in this case!). Thank you. – John Zwinck Oct 09 '14 at 12:06
2

The regex is correct and should not match the string "cei".

The regex can be tested and explained best in Perl:

 my $regex = qr{                 # start regular expression
                 [[:alpha:]]*    # 0 or any number of alpha chars
                 [^c]            # followed by NOT-c character
                 ei              # followed by e and i characters
                 [[:alpha:]]*    # followed by 0 or any number of alpha chars    
               }x;               # end + declare 'x' mode (ignore whitespace)

 print "xei" =~ /$regex/ ? "match\n" : "no match\n";
 print "cei" =~ /$regex/ ? "match\n" : "no match\n";

The regex will first consume all chars to the end of the string ([[:alpha:]]*), then backtrack to find the NON-c char [^c] and proceed with the e and i matches (by backtracking another time).

Result:

 "xei"  -->  match
 "cei"  -->  no match

for obvious reasons. Any discrepancies to this in various C++ libraries and testing tools are the problem of the implementation there, imho.

rubber boots
  • 14,924
  • 5
  • 33
  • 44