2

Suppose I have a list of very simple regex represented as strings (by "very simple", I mean only containing .*). Every string in the list starts and ends with .*. For example, I could have

rs = [.*a.*, .*ab.*, .*ba.*cd.*, ...]

What I would like to do is keep track of those patterns that are a subset of another. In this example, .*a.* matches everything .*ab.* does, and more. Hence, I consider the latter pattern to be redundant.

What I thought to do was to split the strings on .*, match up corresponding elements, and test if one startswith the other. More specifically, consider .*a.* and .*ab.*. Splitting these on .*

a = ['', 'a', '']
b = ['', 'ab', '']

and zipping them together gives

c = [('', ''), ('a', 'ab'), ('', '')]

And then,

all(elt[1].startswith(elt[0]) for elt in c)

returns True and so I conclude that .*ab.* is indeed redundant if .*a.* is included in the list.

Does this make sense and does it do what I am trying to do? Of course, this approach gets complicated for a number of reasons, and so my next question is, is there a better way to do this that anyone has encountered previously?

user4601931
  • 4,982
  • 5
  • 30
  • 42
  • Look into this. It requires an understanding of the state machines and formal grammars that regular expressions rely upon http://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another – Patrick Haugh Dec 07 '16 at 16:52
  • @PatrickHaugh Thanks for your reply. I've seen this post, and a related post on SO that references [this](https://github.com/ferno/greenery) package, but it is computationally prohibitive to use. I was hoping there was a naive approach that was simple-ish and would work for at least some cases. – user4601931 Dec 07 '16 at 17:08

2 Answers2

1

For this problem you need to find the minimal DFAs for both the regex and compare them.
Here is the link of a discussion of same problem- How to tell if one regular expression matches a subset of another regular expression?

Community
  • 1
  • 1
pralave
  • 36
  • 2
1

Assuming every letter combination is surrounded by .* and does not have it in the middle, the approach would almost work. Instead of startswith you need to check for contains, though.

reglist = ['.*a.*', '.*ab.*', '.*ba.*', '.*cd.*']
patterns = set(x.split('.*')[1] for x in reglist)
remove = []
for x in patterns:
    for y in patterns:
        if x in y and x != y:
            remove.append(y)

print (['.*{}.*'.format(x) for x in sorted(patterns - set(remove))])      

gives you

['.*a.*', '.*cd.*']
576i
  • 7,579
  • 12
  • 55
  • 92