2

@polygenelubricants answer to this question includes a C# regex that is used to split a PascalCase string into separate words, namely:

Regex r = new Regex(
   @"  (?<=[A-Z])(?=[A-Z][a-z])    # UC before me, UC lc after me
    |  (?<=[^A-Z])(?=[A-Z])        # Not UC before me, UC after me
    |  (?<=[A-Za-z])(?=[^A-Za-z])  # Letter before me, non letter after me
    ",
   RegexOptions.IgnorePatternWhitespace
);

I would like to use the same regular expression in C++. However, C++ regular expression syntax does not permit lookbehinds of the form (?<=...). Is it possible to make this work anyways?

EDIT: This is clearly not a duplicate. I know C++ doesn't support lookbehinds, I'm asking if the same functionality can be implemented WITHOUT THEM. For reference, here's how to do it with Boost regex, which does support lookbehinds and which I would ideally like to avoid using:

#include <iostream>

#include <boost/algorithm/string/regex.hpp>
#include <boost/regex.hpp>

int main()
{

  boost::regex r(
    "(?<=[A-Z])(?=[A-Z][a-z])"
    "|(?<=[^A-Z])(?=[A-Z])"
    "|(?<=[A-Za-z])(?=[^A-Za-z])"
  );

  std::vector<std::string> input {
    "AutomaticTrackingSystem",
    "XMLEditor",
    "AnXMLAndXSLT2.0Tool"
  };

  for (auto const &str : input) {
    std::vector<std::string> str_split;

    boost::algorithm::split_regex(str_split, str, r);

    for (auto const &str_ : str_split)
      std::cout << str_ << std::endl;
  }
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Peter
  • 2,919
  • 1
  • 16
  • 35
  • Does this answer your question? [Using regex lookbehinds in C++11](https://stackoverflow.com/questions/14538687/using-regex-lookbehinds-in-c11) – Patrick Parker Jan 18 '21 at 19:54
  • @PatrickParker: Not really, I know C++ doesn't have lookbehinds, I would like to know if I can somehow implement this particular regex anyways (without using boost). – Peter Jan 18 '21 at 20:00
  • You can change the regex to not use lookbehind: https://ideone.com/8l8Er3 . That Regex isn't really capturing the whole words, it is capturing the beginning of the word. With a non-lookback version (`[A-Z](?=[A-Z][a-z])|[^A-Z](?=[A-Z])|[A-Za-z](?=[^A-Za-z])`) you can easily look for the ending of the previous word, and shift a little. – xanatos Jan 18 '21 at 21:05
  • @Peter thanks for adding that clarification. It's re-opened – Patrick Parker Jan 18 '21 at 21:53
  • Related: https://stackoverflow.com/questions/43503110/what-is-an-alternative-for-lookbehind-with-c-regex – Jerry Jeremiah Jan 19 '21 at 01:10

1 Answers1

1

You can change the regex to not use lookbehind: [A-Z](?=[A-Z][a-z])|[^A-Z](?=[A-Z])|[A-Za-z](?=[^A-Za-z]).

In the end the original regex was looking for the beginning of the new word, so it had to look behind for the end of the previous word. But we can look for the end of a word and look ahead for the beginning of the next word. Then we only have to "move" the position by +1.

const std::sregex_iterator End;

// the code doesn't handle correctly "",
// handle as a special case
std::string str = "ThisIsAPascalStringX";

std::regex rx("[A-Z](?=[A-Z][a-z])|[^A-Z](?=[A-Z])|[A-Za-z](?=[^A-Za-z])");

std::vector<std::string> pieces;

size_t lastStartPosition = 0;

for (auto i(std::sregex_iterator(str.begin(), str.end(), rx)); i != End; ++i)
{
    size_t startPosition = i->position() + 1;

    pieces.push_back(str.substr(lastStartPosition, startPosition - lastStartPosition));
    lastStartPosition = startPosition;
}

pieces.push_back(str.substr(lastStartPosition));

std::cout << "<-- start" << std::endl;

for (auto& s : pieces)
{
    std::cout << s << std::endl;
}

std::cout << "<-- end" << std::endl;
xanatos
  • 109,618
  • 12
  • 197
  • 280