0

I am currently working on a program that matches certain urls, using regexes. For example, I have this case:

int main()
{
    try 
    {
        std::wstring url = L"www.google.com";
        std::wregex reg(L"^(?:(?!math|latex).)*\\.?stackexchange.com");
        std::regex_search(url, reg);
    }
    catch (std::regex_error e)
    {
        std::cout << e.what() << std::endl;
        std::cout << e.code() << std::endl;
    }
}

Works like a charm, in this case it obviously does not match. But if I use any url string longer than 497 characters, the regex_search function fails. Example:

std::wstring url = L"AAAAA...AAAA"; //498 chars long

e.what() prints: "regex_error(error_stack): There was insufficient memory to determine wheter the regular expression could match the specified character sequence."

e.code() prints: 12

It surprises me that the stack size seems to be the problem. After all, 498 chars aren't that much if you think about searching a file or the source code of a website using regexes. Is there any way to fix this in a solid manner, other than delaying the problem to more (maybe 1000) chars? Maybe with a self written allocator? Though I doubt that it's possible to provide any form of allocator that allocates stackframes on the heap. Increasing the stack size isn't a real solution because it just delays the problem.

I am using Visual Studio Community 2017, Version 15.5.6.

This shows that my assumption seems to be correct.

Exact exception: Unhandled exception at 0x76EECBB2 in RegTest.exe: Microsoft C++ exception: std::regex_error at memory location 0x00454FF8.

Brotcrunsher
  • 1,964
  • 10
  • 32
  • That seems to be a bug in the implementation of `std::regex_search`, assuming your code doesn't do anything stupid. Contact the vector for a fix. – Ulrich Eckhardt Mar 07 '18 at 09:28
  • 1
    It must be due to the tempered greedy token, try `LR"(^(?!.*(?:latex|math).*stackexchange\.com)stackexchange\.com)"` – Wiktor Stribiżew Mar 07 '18 at 09:28
  • @UlrichEckhardt I made a little test case which only runs the code above, so its certainly nothing "stupid" in the rest of my code. What do you mean by "Contact the vector for a fix"? Do you mean the dev team of visual studio? – Brotcrunsher Mar 07 '18 at 09:32
  • Yes, the vendor would be Microsoft, assuming you use their compiler and standard library. – Ulrich Eckhardt Mar 07 '18 at 09:34
  • 3
    *"Unfortunately, the Exception isn't really helpful."* - Maybe not to you, But if you keep it private, those who might make more of it won't have a chance to help you out. At any rate, an exception originating in KernelBase is a C exception (SEH), which is in the vast majority of cases a symptom of undefined behavior. Please show a [mcve]. Please also include the version of the compiler. – IInspectable Mar 07 '18 at 09:40
  • Also, you can try running the code with Release mode. In Debug mode, some additional work is done in the background that causes similar issues. – Wiktor Stribiżew Mar 07 '18 at 09:41
  • @IInspectable: In this case, it's most likely not Undefined Behavior. It's hitting an implementation limit in Visual Studio's implementation of of `std::regex_search`. – MSalters Mar 07 '18 at 10:50
  • The addresses are meaningless (other than being able to determine, that you are running a 32-bit process). Please post the exception **code**. And if at all possible, do change your Visual Studio's language to English. That makes both discovering this question easier as well as allowing you to receive better search results yourself. @MSalters: Maybe, maybe not. Since we don't have a [mcve], all that's left is guessing and speculation. Hardly the modus operandi that Stack Overflow embraces. – IInspectable Mar 07 '18 at 11:03
  • If you `catch` the `std::regex_error` that is being thrown, what does its `what()` and `code()` functions say? – zenzelezz Mar 07 '18 at 11:06
  • @UlrichEckhardt _"Contact the [vendor] for a fix"_ Hahahahaha, good one ;) – Lightness Races in Orbit Mar 07 '18 at 11:17
  • 1
    _"After all 498 chars aren't that much if you think about searching a file or the source code of a website using regexes"_ But you wouldn't use regexes to search the source code of a website, [right](https://stackoverflow.com/a/1732454/560648)? – Lightness Races in Orbit Mar 07 '18 at 11:19
  • @zenzelezz: You cannot catch C exceptions with a C++ `catch`-clause. And since the exception originated in C code, is reported as unhandled, it stands to reason, that it wasn't converted to a C++ exception. – IInspectable Mar 07 '18 at 11:48
  • @IInspectable Where does it say it originated in C code? – zenzelezz Mar 07 '18 at 11:52
  • @IInspectable I provided additional informations. Thanks for pointing that out! – Brotcrunsher Mar 07 '18 at 11:54
  • @zenzelezz; *"It says the exception occurred in KernelBase.dll!76eecbb2()."* KernelBase.dll is written in C. With the newly provided information, it appears, though, that the exception is indeed converted to a C++ exception. The call stack would help verify that. – IInspectable Mar 07 '18 at 11:58
  • There is no such thing as "C exceptions" guys. – Lightness Races in Orbit Mar 07 '18 at 12:09
  • 2
    @Lightness: The vendor of the software seems to [disagree with you](https://learn.microsoft.com/en-us/cpp/cpp/mixing-c-structured-and-cpp-exceptions). The term *"exception"* is far more generic, than a C++ context would have you believe. – IInspectable Mar 07 '18 at 12:23
  • The boost equivalent does not have the same issue. I just tested it with 7000 chars and it works perfectly well. – Brotcrunsher Mar 07 '18 at 13:43
  • What's so funny about contacting MS with bugreports, @LightnessRacesinOrbit? They do have a bugtracker, that's reasonably public (I think it requires registration) and if you paid them money, you are entitled to that support, at least under some jurisdictions. – Ulrich Eckhardt Mar 08 '18 at 08:51
  • 1
    @UlrichEckhardt: Have you ever observed such a report resulting in a fix? – Lightness Races in Orbit Mar 08 '18 at 10:46
  • @Lightness: I have. Many times, actually. Occasionally, you'll get a reply, that an issue will not get fixed, for one reason or another. But you still get a reply. Always. – IInspectable Mar 09 '18 at 18:08

1 Answers1

0

The problem here is that the Standard does not specify an implementation of std::regex_search.

A good implementation will notice that the most significant part of your regex is its second half. stackexchange.com" is a good pattern to match. Once found, it's easy to check what precedes is.

A simple implementation, though, will use your regex in the order that it's written. It will first match ^, which of course happens immediately on any input. It then gets a (?:(?!math|latex).)*, which is one of the worst patterns to match. It's not a formal regular expression but a negative lookahead, followed by a wildcard match ., and that's then repeated *.

As noted in the comments, regex matching is formally greedy. The naive regex engine will be ignoring the final stackexchange.com part and repeatedly try to match (?:(?!math|latex).)*, each time succeeding in that part only to failon the stackexchange.com.

The fix is to be smart yourself, and do the string parsing in the order you know is best. First match .stackexchange.com by a straightforward string::find, not by regex. Only if this is the case, check the subdomain part by a regex_search, and use a simpler expression.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    This doesn't seem to address the problem, which is that a sizeable input crashes the call. – Lightness Races in Orbit Mar 07 '18 at 11:18
  • @lightness: it does; the crash is because the sizable string does not match. My solution detects the no-match without a regex, in this case in O(n) even. This scales to inputs of several gigabyte. – MSalters Mar 07 '18 at 11:34
  • 1
    Unfortunatly, using your string::find approach is not an option because the regex can be set in a config file. The provided regex above is just an example. – Brotcrunsher Mar 07 '18 at 11:40
  • I agree that avoiding the regex entirely will workaround a crash in the regex engine. Even just a single line confirming that there is a bug here (or giving some other reason for the SEH) would probably be enough to make the answer complete if it weren't for the stated restriction that the parsing logic be specific in a config file. – Lightness Races in Orbit Mar 07 '18 at 11:46
  • @Brotcrunsher: You may think of a regex as data, but with the extended syntax it becomes Turing-complete. And by extension, your config file becomes a script file. To make things worse, it's not realistically possible to estimate the complexity of a random regex. – MSalters Mar 08 '18 at 10:17