1

I'm trying to identify actual "regular expression used in a group" within a regular expression string, for example:

([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})

I would like a way to extract groups and get this:

Group 1: ([A-Z])

Group 2: (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})

Group 3: ([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*)

I've tried to use re.compile; this give to me number of groups on it (3) which is ok, this is one of things I wanted to know; but what I'm looking is, actual text for each search group.

I'm seeing this in two ways:

  1. Or a built in method from 're' lib or 'sre_parse' lib (which I've already looked at) to get this info, or any other useful library
  2. Or create an actual Regular expression to analyse the regular expression string...

Now what I really want is to reduce the number of groups (without altering the actual regular expression) to 1 so I can pragmatically identify all groups and "remove" the parenthesis around them until I just left the last one (I just need a group on each re expression)

Now reason why I need this:

I've a program that work like a parser, this program has a vast list of "regular expressions" to try over a string.

So instead of looping over let's say 10 regex to try over each line of log (until one of them match), what I've done is join all the regex from the list into a single line delimited with "|" and use re.findall, what is great about this one is it will bring me a list, of all matches found using all concatenated regex; so effectively this list represent the "matching group" over concatenated regex; using a single group on each regex expression from the list; so any positive match (from regex list ONLY one regular expression will match), group number match will be the index I will use to grab the full regex expression I need to use from that list, and use it with that line. This will eliminate the use of the loop over whole regex list.

The other approach I've used is just "remove" all groups at once, which was working good until I've find out that doing this will corrupt some specific regex:

in my example this regex :

([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\s*C=[A-Z]{2})

will become:

[A-Z] [OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*{1,}\s*C=[A-Z]{2}
                                    ^^^^^
and this is an invalid regex--------^

I can't prevent this as regular expression may change in the future and will be a pain to manually search for this kind of issues ...

So if I isolate each group, I can run a compile to validate that group with and without the parenthesis and decide if take it out or not, so in my case I will interact with 3 groups:

Group 1: ([A-Z])

Group 2: (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})

Group 3: ([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*)

Taking out parenthesis from Group 1 --> Test whole Regex without this group (OK)

Taking out parenthesis from Group 2 --> Test whole Regex without this group (OK)

Taking out parenthesis from Group 3 --> Test whole Regex without this group (invalid re expression) --> leave it alone.

This approach also can help me out to identify other regular expression that I can't leave it with a single group... which I can take care of those.

In fact the third interaction will not be required as will be the last group and this is what I need... a single group in this whole expression.

Any one can suggest a good way to achieve this?

Regards,

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Larry
  • 96
  • 8
  • No is not what I want please read my post. What I want is the actual REGEX expression INSIDE the group NOT the result itself. I wanted a way to get the regex expression used for each group. Using re library instead of building a another RE expression to pull them out (which I've already tried and is not working for me); please understand what I wanted to know I do not want the "result" of this regex applied to an actual line I want the REGEX expression that is being used. like ([A-Z]) – Larry Feb 17 '21 at 12:58
  • How do you test an actual group (using re.compile) if I do not know how to extract the group from regular expression (Literally the RE expression), and remove the parenthesis and reapply the re.compile to test it? By the way take in account this will be done automatically by a program that need to pull the expression to know which group to test – Larry Feb 17 '21 at 13:01
  • From your referral what I want is this 'ERP|Gap' ----> the expression NOT the result applying it into a string – Larry Feb 17 '21 at 13:04
  • Can you please point me then to a link where you can use re library to List, the Regular expression groups (NOT the results, this is even before using re.search, re.match re.find, etc) as I said, using compiled Expression I know count of groups used, what I want is the LIST of these groups in RE notation as I've expressed on my post. – Larry Feb 17 '21 at 13:12
  • Yes me too, that is why I've posted here because I've been searching for last 3 days a way to do it. – Larry Feb 17 '21 at 13:27
  • 1
    What I will do, is re-edit my initial post to clarify it better – Larry Feb 17 '21 at 13:28
  • The `re` library doesn't have something to do what you want. **1.** You would need to keep your regex's in a list and try each regex for each line you're checking, and that would tell you which regex actually matched. However, you would need to be "in control" of the regex to write the splits. **2.** Another way would be like in the [Tokenizer](https://docs.python.org/3/library/re.html#writing-a-tokenizer) example, where you create a list of tuples, which are the names and regexes you're matching. Unless you plan to write/use a regex parser, the groups can't be _automatically_ extracted out. – aneroid Feb 17 '21 at 13:42
  • The first approach is what I'm actually using... a log file of 49K lines takes around 3 days to be processed, so I'm looking for ways to speed up; and this section is one of the bottlenecks I'm trying to resolve. – Larry Feb 17 '21 at 14:36
  • Actually, "*what I've done is join all the regex from the list into a single line delimited with "|" and use re.findall*" might turn out not that great, probably the separate regexps are even faster that that monster of a regex. You can only do that reliably if all these patterns match the entire string, or if you are sure they never match at the same location in the string, else, you may end up with catastrophic backtracking. – Wiktor Stribiżew Feb 17 '21 at 14:59
  • 1
    It seems impossible that 49k lines would take 3 days to process. Even if each line was 1 KB, that's just 50 MB. This kind of regex taking 3 days to process 50 MB of text seems impossible. It could be _really old_ hardware (20-30 yrs) but I suspect the performance problems come from elsewhere. Could you share 10 lines of your log file, which includes matches for the 3 groups as well as non-matches? And just to **confirm**, you originally have the 10 regexes as separate items in a list? – aneroid Feb 17 '21 at 15:08
  • This is a clear indication my comment is correct, the regexps are not matching the entire string, and some or most of them have the same prefix and can match at the same position(s) inside the string. It is a typical catastrophic backtracking scenario. – Wiktor Stribiżew Feb 17 '21 at 15:11
  • If the regex takes too long it's **very** likely that the problem is catastrophic backtracking. (Python regex engine does have this problem, see for example https://stackoverflow.com/questions/40065108/regex-taking-too-long-in-python ) – user202729 Feb 17 '21 at 15:50
  • So removing the group captures will likely not improve the performance. (besides they're all compiled only once anyway) – user202729 Feb 17 '21 at 15:51
  • There's https://stackoverflow.com/questions/42136040/how-to-combine-multiple-regex-into-single-one-in-python for combining them, but it doesn't really work in this case; you can try replacing the groups with atomic groups (although there might be back references in the pattern?...) – user202729 Feb 17 '21 at 15:54
  • Guys, thank you for your comments, the issue is not regex taking time to process (they work fine at proper time) the issue is the program that work with this it is doing a lot of other stuff that make this go slow. I'm working to speed up things and one bottleneck is a this specific process that takes time to parse because is cycling around 30+ times over same line matching different regex, that's why I was trying to simplify the process; what I need is what I've posted reduce number of group to select proper regex directly and apply that one... – Larry Feb 17 '21 at 17:11
  • Then please add your code. Also, please consider posting your code at [codereview.se] if it works in general, but you just want to optimize it. – Wiktor Stribiżew Feb 17 '21 at 17:32

1 Answers1

1

Ok, after spending a lot of time trying to create a parser for regex to properly identify groups for the purpose I wanted... I've stopped doing it and re-do my thinking...

What I did was, using regex list; scan each regex using re.compile; count number of groups, then create an auxiliary list(this will be inside initializing routine of my program as this info will be static) that will contain the index of actual regular expression that contain that group. so for example the following list:

string = [
    "([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})",
    "Received a notarisation request for Tx\\[([A-Z0-9]{64})\\] from \\[(([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})\\]",
    "(Flow \\[([a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9-]{12})\\] error allowed to propagate)"
]

Running new code I got this output:

{0: {'groups': [1, 2, 3]}, 1: {'groups': [4, 5, 6]}, 2: {'groups': [7, 8]}}
[0, 0, 0, 1, 1, 1, 2, 2]

What it does, it mimic group count as if they were part of a single Regular Expression... this also give me a validation point where I can manage number of regex to be joined... as sometimes it could cause troubles.

The actual index for auxiliary list represent matching group, the list content on that index represent index list for actual regular expression that contain that group... (dictionary was used just for testing and references)

Which means that for example group 5 is being used by the regular expression 1 on the string list this solve my initial concern. And this routine now is working faster than before...

I've also adjusted group count on my program due re.compile groups always starts from 1, but python list start from 0; so I've taken that in account too

To anyone interested on how I fixed; see below code:

import re
string = [
    "([A-Z]) (([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})",
    "Received a notarisation request for Tx\\[([A-Z0-9]{64})\\] from \\[(([OLUNCST]{1,2}=[a-zA-Z0-9-. ,]*){1,}\\s*C=[A-Z]{2})\\]",
    "(Flow \\[([a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9-]{12})\\] error allowed to propagate)"
]

def group_index_reference(string_list):
    group_data = {}
    index_grp = []
    group_pos = 0
    for index, each_string in enumerate(string_list, start=0):
        rexp = re.compile(each_string)
        no_groups = rexp.groups
        group_data[index] = {
            "groups": [grp+group_pos for grp in range(1, no_groups + 1)]
        }
        for grp_no in range(group_pos, group_pos + no_groups):
            index_grp.append(index)
        group_pos += no_groups

    print(group_data)
    print(index_grp)

    return index_grp


group_index_reference(string)

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Larry
  • 96
  • 8
  • Ah by the way I forgot to give thanks to all and also to Wiktor Stribiżew for suggesting a profiling tool, I will test the one I saw here https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script they look very nice – Larry Feb 18 '21 at 12:16