I'm trying to identify actual "regular expression used in a group" within a regular expression string, for example:
([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})
I would like a way to extract groups and get this:
Group 1: ([A-Z])
Group 2: (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})
Group 3: ([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*)
I've tried to use re.compile; this give to me number of groups on it (3) which is ok, this is one of things I wanted to know; but what I'm looking is, actual text for each search group.
I'm seeing this in two ways:
- Or a built in method from 're' lib or 'sre_parse' lib (which I've already looked at) to get this info, or any other useful library
- Or create an actual Regular expression to analyse the regular expression string...
Now what I really want is to reduce the number of groups (without altering the actual regular expression) to 1 so I can pragmatically identify all groups and "remove" the parenthesis around them until I just left the last one (I just need a group on each re expression)
Now reason why I need this:
I've a program that work like a parser, this program has a vast list of "regular expressions" to try over a string.
So instead of looping over let's say 10 regex to try over each line of log (until one of them match), what I've done is join all the regex from the list into a single line delimited with "|" and use re.findall, what is great about this one is it will bring me a list, of all matches found using all concatenated regex; so effectively this list represent the "matching group" over concatenated regex; using a single group on each regex expression from the list; so any positive match (from regex list ONLY one regular expression will match), group number match will be the index I will use to grab the full regex expression I need to use from that list, and use it with that line. This will eliminate the use of the loop over whole regex list.
The other approach I've used is just "remove" all groups at once, which was working good until I've find out that doing this will corrupt some specific regex:
in my example this regex :
([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\s*C=[A-Z]{2})
will become:
[A-Z] [OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*{1,}\s*C=[A-Z]{2}
^^^^^
and this is an invalid regex--------^
I can't prevent this as regular expression may change in the future and will be a pain to manually search for this kind of issues ...
So if I isolate each group, I can run a compile to validate that group with and without the parenthesis and decide if take it out or not, so in my case I will interact with 3 groups:
Group 1: ([A-Z])
Group 2: (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})
Group 3: ([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*)
Taking out parenthesis from Group 1 --> Test whole Regex without this group (OK)
Taking out parenthesis from Group 2 --> Test whole Regex without this group (OK)
Taking out parenthesis from Group 3 --> Test whole Regex without this group (invalid re expression) --> leave it alone.
This approach also can help me out to identify other regular expression that I can't leave it with a single group... which I can take care of those.
In fact the third interaction will not be required as will be the last group and this is what I need... a single group in this whole expression.
Any one can suggest a good way to achieve this?
Regards,