0

I have a string

string  ='((clearance) AND (embedded) AND (software engineer OR developer)) AND (embedded)'

I want to break into lists based on the parenthesis, so referring solutions given I have used

my_data = re.findall(r"(\(.*?\))",string)

but when I print my_data, the output is (len = 4)

['((clearance)', '(embedded)', '(software engineer OR developer)', '(embedded)']

but my desired output is (len = 2)

['(clearance) AND (embedded) AND (software engineer OR developer)', '(embedded)']

because "(clearance) AND (embedded) AND (software engineer OR developer)" is in one parenthesis and "embedded" is in another parenthesis. but the "re.findall" is breaking in 4 lists, why?

If I want my desired output, how to modify the regular expression?

Raady
  • 1,686
  • 5
  • 22
  • 46
  • 1
    If it weren't for python you could've used [RegEx recursion](https://regex101.com/r/OWzBKh/1/) without a problem, but python doesn't support that per default (i believe). You could try and install [this](https://pypi.org/project/regex/) and see if you can get the pattern from my first link to work that way, but i can't promise anything (I'm not that familiar with python). – Tobias Tengler Dec 12 '18 at 15:48
  • Will the depth of `()` will always be two or can be higher ? – Code Maniac Dec 12 '18 at 15:51
  • its random, it can be multiple. – Raady Dec 12 '18 at 15:53
  • You want to match balanced parenthesis – Code Maniac Dec 12 '18 at 15:54
  • 1
    With an unclear amount of parenthesis, this can *not* be done with pure regex. Refer to this related answer: https://stackoverflow.com/a/1732454/8408080 – user8408080 Dec 12 '18 at 15:56
  • @user8408080 it can be done easily if balanced parenthesis is not necessary. – Code Maniac Dec 12 '18 at 15:57
  • is there any other way to detect the patterns or should I write the program manually to count open and closed parenthesis ? – Raady Dec 12 '18 at 15:58
  • I think writing your own parser for this is not hard, you just need one counter. Remember the index where the counter goes from 0 to 1 and remember the index where the counter goes from 1 to 0. There you have your positions – user8408080 Dec 12 '18 at 16:00

2 Answers2

3

In pure regex, this would not be possible, so here is an idea that counts parenthesis:

def find_stuff(string):
    indices = []
    counter = 0
    change = {"(":1, ")":-1}
    for i, el in enumerate(string):
        new_count = counter + change.get(el, 0)
        if counter==0 and new_count==1:
            indices.append(i)
        elif counter==1 and new_count==0:
            indices.append(i+1)
        counter = new_count
    return indices

This is not very beautiful, but I think the concept is clear. It returns the indices of outer parenthesis, so you can just slice your string with these

user8408080
  • 2,428
  • 1
  • 10
  • 19
  • 1
    Thanks a lot, it worked. Though I really didn't understand what "change" is doing. – Raady Dec 13 '18 at 09:56
  • 1
    I used the dict as a quick way to count parenthesis. I always add `change[el]` to the `counter`. If `el` is a `(`, I this adds a `1`, meaning, that there is one unclosed bracket. If `el` is `)` it adds `-1`, effectively saying, that there is a closing bracket, so i'm one level down. The `get` function takes a default argument, so if `el` is neither `(` nor `)`, the current "deepness level" is changed by `0`, so not at all – user8408080 Dec 13 '18 at 10:05
  • 1
    great got it now. Thanks ! – Raady Dec 13 '18 at 10:07
1

A bit of an re hack, but this is possible:

>>> string  ='((clearance) AND (embedded) AND (software engineer OR developer)) AND (embedded)'
>>> [e for e in re.split(r'\((?=\()(.*?)(?<=\))\)|(?<!\()(\([^()]+\))(?!\))',string) if e and '(' in e and ')' in e]
['(clearance) AND (embedded) AND (software engineer OR developer)', '(embedded)']
dawg
  • 98,345
  • 23
  • 131
  • 206