I'm attempting to implement a text clustering algorithm. The algorithm clusters similar lines of raw text by replacing them with regexes, and aggregates the number of patterns matching each regex so as to provide a neat summary of the input text instead of showing repetitive patterns from the input text. In this attempt I ran into the need to find if one regex covers another.
Assume we are concerned only about regular expressions with '*' and '+' wild cards i.e. '*' meaning zero or more occurences of an alphabet, and '+' meaning 1 or more occurences of an alphabet. Also assume the character set to be ASCII.
For example:
1. AB covers AB
This is straightforward.
2. ABC* covers ABC
Because ABC* can generate: ABC, ABCC, ABCCC etc.
3. A*B+C* covers AB+C*
Because A*B+C* can generate ABBC, AABBC, AABBCC etc. which covers
all strings generated by AB+C*.
4. A+M+BC* covers AMM+B+C+M+BC*
Similar to case [3] above.
Basically I'm looking for an efficient implementation of the following method which tells if strA (may contain a regex) covers for strB (may contain a regex). Note that there should also be a way to escape the regex characters '*' and '+' in the input strings strA and strB.
Method signature in C++:
bool isParentRegex(const string& strA, const string& strB)
My thinking is that a recursive approach is required for the implementation and it may be a bit complicated. But I'm curious to know if I could reuse existing implementations instead of re-inventing the wheel or if there are any other straightforward ways to do it.