5

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.

<identifier> = "\\b\\w+\\b".

As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.

When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.

So the problem of this question arises - how to get the ID of the group?

Similar question here, but it does not provide the solution to my specific problem.

Exactly my problem here, but it's in JS, and I need a C/C++ solution.

So let's say I've got a regex, made up of capturing groups separated by an OR:

(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)

which matches the the whole numbers or alpha-words.

My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string

foo bar 123

3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.

I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).

I don't have much knowledge about boost::regex or PCRE regex library.

What is the best way to accomplish this task? Which is the library and method to use?

hakeris1010
  • 285
  • 1
  • 5
  • 12
  • 1
    Loop over all the matches until you find a non-empty match, that's the one that matched. – Barmar Jan 12 '18 at 22:12
  • @Barmar Not really. The `std::regex` result would just be an array of non-empty submatches. E.g. when the 6th group was the only one matched, the `std::match_results` result array will contain 2 entries: the whole regex match, and a submatch of the 6th group, which would be at index 1 of the array, because it's the first one that matched. We can't get the group's index `6` in the regex from that. – hakeris1010 Jan 12 '18 at 22:25
  • Are you sure about that? In every other language, groups are numbered based on the RE, not whether they matched. You need to be able to refer to a specific match without worrying about whether previous groups matched. This seems to make capture groups impossible to use reliably in C++. – Barmar Jan 12 '18 at 22:29
  • As an alternative, run these regexps separately. – Wiktor Stribiżew Jan 12 '18 at 22:31
  • `which would be at index 1 of the array` - This is never true !! The match object is _pre-allocated_, before matching, based on the number of capture groups defined. The actual _sub_match_ object accessed when you de-reference the match_object via index `match[group number]` which returns a pointer to a _sub_match_ object that exists in some other array. If, when dereferencing the match_object it contains no pointer, it is _NULL_. So, what you do is iterate over the number of groups, dereferencing the match_object. If it's not NULL, that's the one that matched. Make sure _no-subs_ is not set. –  Jan 13 '18 at 00:02
  • Just an fyi, the match_object _never_ stores string data from the match. Instead it creates a list of objects, each containing a paired pointer (among other things) into the source string. This paired structure is called the _sub_match_ object. And a _pointer_ to that structure is stored in a list within the pre-sized array. Dereferencing the match_object returns the pointer in that array position. –  Jan 13 '18 at 00:19

1 Answers1

4

You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:

std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';

    for(auto index = 1; index < m.size(); ++index ){
        if (m[index].matched) {
            std::cout << "Capture group ID: " << index-1 << std::endl;
            break;
        }
    }
}

See the C++ demo. Output:

Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1

Note that R"(...)" is a raw string literal, no need to double backslashes inside it.

Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    It works as expected. It turned out that my problem was not with the system, but with the regex I used! In my original test code (which I haven't published here), i used this regex: `(\b\w+\b)|(\b\d+\b)`, and it matched only the first group, because the `\w` matches digits too! And this simple mistake led led me to doubt the system and publish a question! – hakeris1010 Jan 13 '18 at 14:42
  • I think it may be better to use `m[index].matched` instead of `!m[index].str().empty()`. The `matched` field ["indicates if this match was successful"](https://en.cppreference.com/w/cpp/regex/sub_match) – LVK Dec 15 '21 at 13:52
  • @LVK Great tip, this is much better as in a general case, a matched subgroup can actually hold an empty string. – Wiktor Stribiżew Dec 15 '21 at 14:03