How to get all possible matches of std::regex

Question

I would like to find all possible matches of regex, how is it possible?

regex rx("(2|25)");
string s = "2225";
for (sregex_iterator it(s.begin(), s.end(), rx), end; it != end; ++it) {
    cout << it->position() << ": " << it->str() << endl;
}

Gives output:

0: 2
1: 2
2: 25

But can't find third 2: 2 exactly. I prefer to use regex because of O(n) complexity for searching several tokens at same time.

UPDATE:

Maybe split token list to non-prefixable lists and create several regexes? For example: (2|4|25|45|251|455|267) => (2|4), (25|45|267), (251|455) This will grow complexity to something like O(n log(m))

UPDATE 2:

Please, provide short STL-based algorithm of splitting token vector to non-prefixable vectors to answer this question.

If you only want matches on the `2` why do you use `|25` as part of your regex? — Phylogenesis, Oct 15 '15 at 07:35
@Phylogenesis I wanna discover all 4 matches for `O(n)` complexity :) — k06a, Oct 15 '15 at 07:37
I believe you cannot match the same character in two different matching groups (ie. you won't be able to match the `2` as part of `25` but also on its own). — Phylogenesis, Oct 15 '15 at 07:44
I believe this problems can only be done with two regular expressions. I tested some implementations other than std::regex, and they matches three "2" and no "25". It seems that no regex implementations give up what has been matched and try to find other way to match it. — yzn-pku, Oct 15 '15 at 07:48
@Phylogenesis I hope it is possible without creating `m` regexes for each token. Maybe split token list to non-prefixable lists and create several regexes? — k06a, Oct 15 '15 at 07:49
@k06a In which case, why use regex? This turns into a simple text search problem. — Phylogenesis, Oct 15 '15 at 07:51
@Phylogenesis regex can give me `O(n)` complexity for searching several tokens simultaneously. — k06a, Oct 15 '15 at 08:02
Regex will not just reduce algorithmic complexity. You are having to check whether each of `n` characters are the start of `m` different strings. You are going to struggle to find an algorithm that works better than `O(nm)` in general by the nature of your search space (the best you can really hope for is to do some pre-calculation to find matching prefixes). — Phylogenesis, Oct 15 '15 at 08:08
@Atomic_alarm I thought std regex is a little bit changed boost regex, is not it? — k06a, Oct 15 '15 at 08:08
@k06a, unfortunately no. it is an opportunity that you need is not supported. Similar question: http://stackoverflow.com/questions/14538687/using-regex-lookbehinds-in-c11 — , Oct 15 '15 at 08:15
@k06a From your edit, I'm not sure you really understand algorithmic complexity. You can reduce the number of passes you need to perform, but each pass is doing more work. Searching for `(2|4)` will look for both 2s and 4s in a single pass, but will require twice the amount of work to do so. It doesn't actually reduce the complexity. — Phylogenesis, Oct 15 '15 at 08:19
@Phylogenesis I hope std regex uses DFA. So it should be O(n) with any complex expression. — k06a, Oct 15 '15 at 08:21
A DFA does not reduce the complexity of testing a match to `O(1)`. It depends the number of strings you're matching, but also on the similarity of the strings you're matching (which you're proposing to actually reduce by pulling strings with matching prefixes into separate passes). Doing what you're doing is still pretty close to `O(nm)` complexity. — Phylogenesis, Oct 15 '15 at 08:35
@Phylogenesis DFA is automata with single state at any moment. So transition from state to state is just `O(1)` for every input char. Isn't it? — k06a, Oct 15 '15 at 08:39
If you only want to match a single pattern. You need multiple state transitions if you want to match multiple patterns. — Phylogenesis, Oct 15 '15 at 08:44
@Phylogenesis thats why regex libraries constructs NFA from expression and then transforms it to DFA to achieve real `O(n)` complexity. — k06a, Oct 15 '15 at 08:58
I know it's been a while, but the new answer might help you out for the _regex_ part of your problem. Regarding algorithmic complexity, not much to note, but it does gets progressively quicker to search through the string after each match, as the string to search through reduces in length after removing previously matched text. Keep in mind the regex engine moves from left-to-right. — kayleeFrye_onDeck, Jun 27 '17 at 19:53
Possible duplicate of [Given a regular expression, how would I generate all strings that match it?](https://stackoverflow.com/questions/20080789/given-a-regular-expression-how-would-i-generate-all-strings-that-match-it) — Anderson Green, Feb 10 '18 at 20:38

score 2 · Answer 1 · answered Oct 15 '15 at 07:48

2

I dont think it's possible with an iterator and a single regexp. Here's how it works.

Your regexp searches for a substring that is either "2" or "25". Now, you start the search with sregex_iterator. It starts with the first symbol of the string, and tries to find match with your regular expression. If there is a match, it is "recorded", and the iterator is advanced to the position after the match. If there is no match, the iterator is advanced 1 position forward. This process continues until the end of the string is reached.

Now, each time it finds a match it will try to find the best (i.e., longest) match from your regular expression. So if a substring matches both 2 and 25, it will take 25 since it's longer. So I'd say you need 2 regular expressions.

answered Oct 15 '15 at 07:48

SingerOfTheFall

29,228
8
68
105

Help me to avoid `m` regular expression, each one will contain only 1 token :) Complexity grows to `O(nm)` :( – k06a Oct 15 '15 at 07:50
1

I don't think it actually matches the longest match, it matches the *first* match. The order of the alternates matters: see this [example](http://ideone.com/Jgq9pZ). – Phylogenesis Oct 15 '15 at 07:55
@Phylogenesis, that's fascinating because on my machine the result is exactly the opposite (`rx1` finds `2 2 25` and `rx2` finds `2 2 2`) – SingerOfTheFall Oct 15 '15 at 07:56
1

`std::(basic_)regex` should default to using ECMAScript rules if I'm not mistaken, which should prefer the first matching alternative. If you use extended / POSIX it will prefer the longer one IIRC. Try this: http://coliru.stacked-crooked.com/a/79388d279f74ce52 – melak47 Oct 15 '15 at 07:56
1

@k06a, you can group them if the subexpressions are not prefixes of any other subexpressions, e.g. you can group `3|45` because `3` is not a prefix of `45` – SingerOfTheFall Oct 15 '15 at 07:58
@melak47 looks like ECMA works different :) cpp.sh/7qni – k06a Oct 15 '15 at 08:01
This seems to [have changed since gcc 5](http://coliru.stacked-crooked.com/a/2f29ba636b098818) – melak47 Oct 15 '15 at 08:04
If I change the OPs original regexp to `regex rx("(25|2)");`, it finds `2,2,2`. I fail to understand the logic here honestly – SingerOfTheFall Oct 15 '15 at 08:04
@SingerOfTheFall looks like order should not me important at all :) But it is terrible if order impacts. – k06a Oct 15 '15 at 08:06

score 1 · Answer 2 · answered Oct 15 '15 at 08:02

1

You can't obtain the third '2', because regexes always return the longest match. In order to get "all the possible matches" you need to run the query two times, since 2 is contained in 25.

answered Oct 15 '15 at 08:02

deight

331
1
4
15

How to get all possible matches of std::regex

2 Answers2

Linked