-1

Given a regular expression re and an input string str, I want to find the maximal substring of str, which starts at the minimal position, which matches re.

Special case:

re = Regex("a+|[ax](bc)*"); str = "yyabcbcb"

matching re with str should return the matching string "abcbc" (and not "a", as PCRE does). I also have in mind, that the result is as I want, if the order of the alternations is changed.

Cosinus
  • 31
  • 4
  • Then use `"[ax](?:bc)*|a+"`. Once `a` is matched with `a+`, you cannot expect the next alternative to be tried. – Wiktor Stribiżew Feb 04 '18 at 18:56
  • But that is exactly, what I want. I need a kind of "greedy alternation". Also it is not an option to adapt the given regex to the input string. For example with str = "yyaaaa" the ouput shall be "aaaa" and not "a". – Cosinus Feb 04 '18 at 19:59
  • PCRE is an NFA regex engine, where the first alternation pattern that matches makes the regex engine skip over the rest of alternatives. There is no way to make it behave differently. Use a DFA regex library as in `sed` / `grep` and other POSIX based regex engines. – Wiktor Stribiżew Feb 04 '18 at 20:07
  • Can you recommend a specific regex library, which can serve my purpose? Ideally it should be C-code available for Linux, Windows, Mac-OS. – Cosinus Feb 04 '18 at 21:06

2 Answers2

0

The options I found were:

  1. POSIX extended RE - probably outdated, used by egrep ...
  2. RE2 by Google - open source RE2 - C++ - also C-wrapper available
Cosinus
  • 31
  • 4
0

From my point of view, there are two problems with your question.

First is that changing the order of alternations the results are supposed to change.

For each single 'a' in the string, it can either match 'a+' or "ax*". So it is ambiguous for matching 'a' to alternations in your regular expression.

Second, for finding the maximal substring, it requires the matching pattern of the longest match. As far as I know, only RE2 has provided such a feature, as mentioned by @Cosinus.

So my recommendation is that separating "a+|ax*" into two regexes, finding the maximal substring in each of them, and then comparing the positions of both substrings.

As to find the longest match, you can also refer to a previous regex post description here. The main idea is to search for substrings starting from string position 0 to len(str) and to keep track of the length and position when matched substrings are found.

P.S. Some languages provide regex functions similar to "findall()". Be careful of using them since the returns may be non-overlapping matches. And non-overlapping matches do not necessarily contain the longest matching substring.

Peipei
  • 136
  • 1
  • 9