8

In http://llvm.org/svn/llvm-project/libcxx/trunk/test/re/re.alg/re.alg.match/ecma.pass.cpp, the following test exists:

    std::cmatch m;
    const char s[] = "tournament";
    assert(!std::regex_match(s, m, std::regex("tour|to|tournament")));
    assert(m.size() == 0);

Why should this match be failed?

On VC++2012 and boost, the match succeeds.
On Javascript of Chrome and Firefox, "tournament".match(/^(?:tour|to|tournament)$/) succeeds.

Only on libc++, the match fails.

ganaware
  • 163
  • 1
  • 4
  • In my tests. `c` matches `(a|b)|c` and `a|(b|c)`, but does not match `a|b|c`. `a` and `b` match all three. In general, if more than two expressions are chained with `|`, only the first two seem to work. I think it's a bug but they call this test "ecma.pass.cpp" for some reason, so I'm not sure. – n. m. could be an AI Jul 12 '13 at 08:07
  • [Here](http://stackoverflow.com/questions/9764264/strange-bug-stdregex-matches-only-first-two-strings) is another report of the same. – n. m. could be an AI Jul 12 '13 at 08:14
  • 1
    Hmm. I think this requires a bug report. Even if it’s actually correct, the test should document *why* it’s correct. And I doubt that it is. – Konrad Rudolph Jul 12 '13 at 11:47

1 Answers1

5

I believe the test is correct. It is instructive to search for "tournament" in all of the libc++ tests under re.alg, and compare how the different engines treat the regex("tour|to|tournament"), and how regex_search differs from regex_match.

Let's start with regex_search:

awk, egrep, extended:

regex_search("tournament", m, regex("tour|to|tournament"))

matches the entire input string: "tournament".

ECMAScript:

regex_search("tournament", m, regex("tour|to|tournament"))

matches only part of the input string: "tour".

grep, basic:

regex_search("tournament", m, regex("tour|to|tournament"))

Doesn't match at all. The '|' character is not special.

awk, egrep and extended will match as much as they can with alternation. However the ECMAScript alternation is "ordered". This is specified in ECMA-262. Once ECMAScript matches a branch in the alternation, it quits searching. The standard includes this example:

/a|ab/.exec("abc")

returns the result "a" and not "ab".

<plug>

This is also discussed in depth in Mastering Regular Expressions by Jeffrey E.F. Friedl. I couldn't have implemented <regex> without this book. And I will freely admit that there is still much more that I don't know about regular expressions, than what I know.

At the end of the chapter on alternation the author states:

If you understood everything in this chapter the first time you read it, you probably didn't read it in the first place.

Believe it!

</plug>

Anyway, ECMAScript matches only "tour". The regex_match algorithm returns success only if the entire input string is matched. Since only the first 4 characters of the input string are matched, then unlike awk, egrep and extended, ECMAScript returns false with a zero-sized cmatch.

Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • Your explanation might be true, but the whole purpose of a regular expression and specificly of an alternation would be questionable. If that's like you think, what is gained from such behavior? – rubber boots Jul 12 '13 at 16:14
  • I'm the wrong person to ask. I am not a regex expert, nor did I participate in the design of any regex engine, or even in the C++ standardization of `std::regex`. I just implemented `std::regex` after it was standardized. – Howard Hinnant Jul 12 '13 at 16:21
  • A `regex_match` **should** behave exactly like `regex_search` with the further constraint to match **the whole string given** (and not only a part of it, like `regex_search`). The *non-match* in the above example contradics every convention and every expectation (I did read every Friedl book and use regular expressions on a daily base). – rubber boots Jul 12 '13 at 16:31
  • Is your comment saying something different than the last paragraph in my answer? – Howard Hinnant Jul 12 '13 at 16:41
  • 3
    I'm not convinced that `regex_match` should really behave as you describe, Howard. The description in the standard is notoriously unclear. You describe it in terms of "match the regular expression and then check that the match spans the whole input". But it could also mean "match the regular expresssion as if enclosed in ^..$". I really think the wording leaves much to be desired here. – Sebastian Redl Jul 12 '13 at 16:44
  • @HowardHinnant - You said in your last paragraph: "*ECMAScript returns false with a zero-sized cmatch*". I answered in the sense of "*ECMAScript returns **a false** false with a zero-sized cmatch*" – rubber boots Jul 12 '13 at 16:50
  • 1
    @SebastianRedl: I freely admit to struggling with the specification during implementation. A lot of that struggle was because of the massive amount of referencing other standards. On this particular question, [re.alg.match]/p2 seems unambiguous to me: Determines whether there is a match between the regular expression e, and all of the character sequence [first,last). – Howard Hinnant Jul 12 '13 at 16:55
  • @rubberboots: Thanks for the clarification. I have no comment or opinion concerning the quality of the ECMAScript specification. I simply tried to faithfully follow it. – Howard Hinnant Jul 12 '13 at 16:56
  • Please explain why in the `std::regex` implementation `a` matches `a|b|c` and `b|a|c`, but does not match `c|b|a`. Does this result agree with your interpretation of the ECMAScript specifications? – n. m. could be an AI Jul 12 '13 at 22:39
  • @n.m.: That sounds like a bug. But I'm not reproducing it with tip-of-trunk. Bill Fisher has been fixing regex bugs in libc++ over the past two weeks. He may have fixed this one. – Howard Hinnant Jul 12 '13 at 23:12
  • So if `a|b|c` is a bug, how come `tour|to|tournament` isn't? It's got all the same symptoms. The first two alternatives can be matched, the third and all the subsequent ones doesn't seem to match anything. – n. m. could be an AI Jul 13 '13 at 00:05
  • @n.m.: Sorry, I don't understand the question. In your earlier post you said `c|b|a` didn't match, whereas `a|b|c` did. Now you're saying `a|b|c` "is a bug", and comparing it to `tour|to|tournament`. I told you that `c|b|a` ought to match the input string `a` and if it doesn't, that's a bug. I'm lost. My understanding of ECMAScript is that an input string will match the first alternation it can. This differs from awk, egrep and extended, which match the longest alternation they can (as I understand it). – Howard Hinnant Jul 13 '13 at 00:26
  • I don't understand what is not clear here. An input string `X` should match the regular expression `X|Y|Z`, `Y|X|X`, and `Y|Z|X` no matter what `X` `Y` and `Z` are (having no special characters of course). Do you agree? If so why are you saying that the test is correct when `X` `Y` `Z` are equal to `tour` `to` `tournament` respectively? – n. m. could be an AI Jul 13 '13 at 05:52
  • @HowardHinnant: Thank you. I understand why the match fails. I think it is a bit confusing, but the C++11 specification requires the effect. – ganaware Jul 13 '13 at 10:39
  • @n.m.: My understanding of ECMAScript is that an input string will match the first alternation it can. This differs from awk, egrep and extended, which match the longest alternation they can (as I understand it). – Howard Hinnant Jul 13 '13 at 13:35
  • So if it cannot match the first alternative, will it try to match the second one? the third one? If yes, why the second one is matched and the third one is not? If not, why these other alternatives are useful? What you think is broken, the specification or your understanding of it? – n. m. could be an AI Jul 13 '13 at 14:05
  • I think there is some misunderstanding here. You keep talking about how matching the first good alternative differs from matching the longest good one. In this test there is only one good alternative. There is no question of which alternative should be matched (the good one should, there is only one). The question is why the entire expression is not matched when there is a good alternative. – n. m. could be an AI Jul 13 '13 at 14:32
  • Howard, is the libc++ regex implementation yours (the answer seems to imply this)? If so, that’s scary. Now, don’t get me wrong, I have the greatest respect for you and your ability, and looking at the test cases you’ve probably done an admirable test, both in terms of correctness, and probably also performance. But, damn, modern regexes are *hard* (as you’ve noticed yourself), especially for performance. Reinventing the wheel instead of using an established library seems insane, quite frankly. It sounds like an incredibly bad idea. Isn’t it impossible to adapt PCRE etc? – Konrad Rudolph Jul 13 '13 at 18:28
  • @n.m.: See the Get|GetValue|Set|SetValue example here: http://www.regular-expressions.info/alternation.html – Howard Hinnant Jul 13 '13 at 21:36
  • @KonradRudolph: To the best of my knowledge, PCRE will not satisfy the C++11 specification, which requires 6 separate regex engines. – Howard Hinnant Jul 13 '13 at 21:37
  • @Howard Of course, PCRE was a brain fart. And I agree that one single library won’t do – but wouldn’t the most sensible course of action be to use one library per supported syntax, and write an adaptor for each? Incidentally, we discussed this in the Lounge chat just after I posted the comment, without reaching any real result. – Konrad Rudolph Jul 13 '13 at 21:43
  • 2
    @SebastianRedl: I now see the ambiguity you described. It would not be a bad idea to submit an issue to resolve it. http://cplusplus.github.io/LWG/lwg-active.html#submit_issue – Howard Hinnant Jul 13 '13 at 21:50
  • OK I now see that ECMAscript specification does not use the word "match" in its normal sense. By "match" they seem mean "match an initial non-zero-length portion of the string", or perhaps something else, I'm not sure. – n. m. could be an AI Jul 14 '13 at 03:46
  • Or rather their "match" really means "search" (match any portion of the string, not just the initial one). – n. m. could be an AI Jul 14 '13 at 04:02
  • So the question is, should we interpret "match entire sequence" as "find the first matching subsequence and then see if it coincides with the entire sequence". IMHO such interpretation has basis neither in the standard nor in the common usage of the word "match" as we know it from the theory of regular expressions. The interpretation of "search as if the entire regex is surrounded by `^$` is much much more reasonable. – n. m. could be an AI Jul 14 '13 at 04:55
  • To add insult to injustice, my tests were made with libg++ not libc++ (yes I'm slow and thick). – n. m. could be an AI Jul 14 '13 at 05:19
  • 2
    I now believe that your interpretation of "match" does not conform to the standard and the other one does. This space is to small to contain an argument however. – n. m. could be an AI Jul 14 '13 at 06:31
  • I also believe that the statement "C+11 specification [...] requires 6 separate regex engines" is incorrect. There is only one engine. The standard talks about "the regular expression engine" anyway. There are six different grammars to translate string representations of regexes to the internal format, but there is only one internal format. The matching algorithm flags are a totally separate set, not related to the grammar that was used to translate the regex . – n. m. could be an AI Jul 14 '13 at 08:46
  • This is tragic. For regex_match(/pattern/) to act differently from regex_match(/^pattern$/) is clearly wrong. – Jim Balter Nov 09 '14 at 11:12