4

I'm trying to execute the following C++ STL-based code to replace text in a relatively large SQL script (~8MB):

std::basic_regex<TCHAR> reProc("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$\n((^(?![ \t]*go[ \t]*).*$\n)+)^[ \t]*go[ \t]*$");
std::basic_string<TCHAR> replace = _T("ALTER $1 $2\n$3\ngo");
return std::regex_replace(strInput, reProc, replace);

The result is a stack overflow, and it's hard to find information about that particular error on this particular site since that's also the name of the site.

Edit: I am using Visual Studio 2013 Update 5

Edit 2: The original file is over 23,000 lines. I cut the file down to 3,500 lines and still get the error. When I cut it by another ~50 lines down to 3,456 lines, the error goes away. If I put just those cut lines into the file, the error is still gone. This suggests that the error is not related to specific text, but just too much of it.

Edit 3: A full working example is demonstrated operating properly here: https://regex101.com/r/iD1zY6/1 It doesn't work in that STL code, though.

BlueMonkMN
  • 25,079
  • 9
  • 80
  • 146
  • do you know the `strInput` that triggers the stack overflow? – kmdreko May 18 '16 at 16:14
  • @vu1p3n0x Yes, but I'm not sure how to share such a large input string. I don't want to put 8 MB of text in the question. – BlueMonkMN May 18 '16 at 16:15
  • your regex is bounded by line (`"^...$"`) Is the file all one line? or is there a single line that triggers it? or is it only when processing the whole file at once what triggers it? – kmdreko May 18 '16 at 16:21
  • I added a link to a page on regex101 that includes the shortest version of the input string that causes the error. – BlueMonkMN May 18 '16 at 17:02
  • 1
    IMO you should replace this complicated regex by a loop over the lines - `std::find_if` for "create (view|proc)", `std::find_if` for "go", grab everything in-between and do your replacement this way. – Sebastian Redl May 18 '16 at 17:06
  • @SebastianRedl Unfortunately I'm very unfamiliar with STL. If you could demonstrate that in an answer, I'd love to try it. – BlueMonkMN May 18 '16 at 17:09
  • 1
    Is this me or you just want to change `create` to `alter` for each procedure/view ? – Thomas Ayoub May 18 '16 at 17:36
  • @ThomasAyoub I think that sums it up. – BlueMonkMN May 18 '16 at 17:39
  • It might be naive, but isn't [this](https://regex101.com/r/hF0uP8/1) an acceptable solution (even if it doesn't solve the SOE)? – Thomas Ayoub May 18 '16 at 17:47

2 Answers2

2

The following trimmed-down version of your regex saves about 20% of processing steps according to regex101 (see here).

\\bcreate[ \t]+(view|procedure|proc)[ \t]+(.+)\n(((?![ \t]*go[ \t]*).*\n)+)[ \t]*go[ \t]*

Modifications:

  • inline anchors removed: you are expressly testing for newline characters
  • repetition operator for the db object keywords removed - a repetition at this point would make the original script syntactically invalid.
  • initial whitespace pattern replaced by word boundary (note the double backslash - the escape sequence is for the regex engine, not for the compiler)

If you can be sure that ...

  • the create ... statements do not occur in string literals, and

  • you do not need to distinguish between create ... statements followed by a go or not (eg. because all statements are trailed by a go)

...it might even be easier to just replace these strings:

std::basic_regex<TCHAR> reProc("\bcreate[ \t]+(view|procedure|proc)");
std::basic_string<TCHAR> replace = _T("ALTER $1");
return std::regex_replace(strInput, reProc, replace);

(Here is a demo for the latter approach - reduces the steps to a little more than 1/4 th).

collapsar
  • 17,010
  • 4
  • 35
  • 61
  • The "\b" at the beginning seems to be preventing the STL regex from matching a `create` at the beginning of a line for some reason. Need a double \\ I assume. – BlueMonkMN May 18 '16 at 17:50
  • It seems this solution is still going to be far too slow. If I were writing C# code I would split it up by "\ngo\n" and replace in each component. But I don't know how to do that in STL. It's been running more than a minute and still not done, and I think VB6 was able to do this in less than a minute by processing it one line at a time (I'm rewriting some old code). I thought I could simplify the code by processing it all at once, but the cost turns out to be too high. I don't even know how to split up text into lines with STL. – BlueMonkMN May 18 '16 at 17:55
  • So maybe splitting as suggested in [this SO answer](http://stackoverflow.com/a/13172514) would help? The text portions to be replaced do not appear to span lines. – collapsar May 18 '16 at 18:04
  • The demo from the abovementioned answer adjusted to your sample input is online [here](http://ideone.com/GsKGyJ) – collapsar May 18 '16 at 18:13
1

It turns out that STL regular expressions are tragic under-performers versus Perl (about 100 times slower if you can believe https://stackoverflow.com/a/37016671/78162), so it's apparently necessary to absolutely minimize the use of regular expressions in STL/C++ when performance is a serious concern. (The degree to which C++/STL under-performs here blew my mind considering I presume C++ to generally be one of the more performant languages). I ended up passing the file stream to read one line at a time and only run the expression on lines that needed processing like this:

   std::basic_string<TCHAR> result;
   std::basic_string<TCHAR> line;
   std::basic_regex<TCHAR> reProc(_T("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$"), std::regex::optimize);
   std::basic_string<TCHAR> replace = _T("ALTER $1 $2");

   do {
      std::getline(input, line);
      int pos = line.find_first_not_of(_T(" \t"));
      if ((pos != std::basic_string<TCHAR>::npos) 
          && (_tcsnicmp(line.substr(pos, 6).data(), _T("create"), 6)==0))
         result.append(std::regex_replace(line, reProc, replace));
      else
         result.append(line);
      result.append(_T("\n"));
   } while (!input.eof());
   return result;
Community
  • 1
  • 1
BlueMonkMN
  • 25,079
  • 9
  • 80
  • 146