4

Given the string

S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

I'd like to extract everything within the parentheses UNLESS the parens are inside a quotation. So far I've managed to get everything within parentheses, but I can't figure out how to stop from splitting on the inner parenthesis inside the quotes. My current code is:

import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

p = re.compile( "\((.*?)\)" )
m =p.findall(S)
for element in m:
    print element

What I want is:

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

What I currently get is:

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation
45795362,-1,'!!_(disambiguation

What can I do in order to ignore the internal paren?

Thank you!!


In case it helps, here are the threads I've looked at:

1) REGEX-String and escaped quote

2) Regular expression to return text between parenthesis

3)Get the string within brackets in Python

Community
  • 1
  • 1
MayaR
  • 41
  • 5

4 Answers4

3

You can use a non-capturing group to assert either a comma or the end of the string follows:

p = re.compile(r'\((.*?)\)(?:,|$)')

Working Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
1
for element in S[1:-1].split('),('):
    print element
mmachine
  • 896
  • 6
  • 10
1

You could use the below regex.

>>> import re
>>> s = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"
>>> for i in re.findall(r"\(((?:'[^']*'|[^()])*)\)", s):
        print(i)


45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

Explanation:

  • \( - Matches a literal ( symbol.
  • ( - Start of a capturing group.
  • (?:'[^']*'|[^()])* - '[^']*' part matches greedily the single quoted block. If there is any (, ) symbols present inside that, it won't care about that. Because we used [^']* which matches any character but not of ' , zero or more times. If the following character is not the start of a single quoted block then the control transfers to the pattern which exists next to the | symbol ie, [^()] which matches any character but not of ( or ). So the whole (?:'[^']*'|[^()])* will match a single quoted block or any char not of (, ) , zero or more times.
  • ) end of the capturing group.
  • \) literal ) symbol.

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Some simple approach would be negative lookahead - check that after closing brace no quote follows, e.g.

import re
S = "(45171924,-1,'AbuseFilter/658',2600),(43795362,-1,'!!_(disambiguation)',2600),(45795362,-1,'!!_(disambiguation)',2699)"

m = re.findall(r'\((.*?)\)(?![\'])', S)
for element in m:
    print element

prints

45171924,-1,'AbuseFilter/658',2600
43795362,-1,'!!_(disambiguation)',2600
45795362,-1,'!!_(disambiguation)',2699

http://www.codeskulptor.org/#user39_CL89xhroV0_0.py

I have put the quote in character class (square brackets), so that you could add other symbols, which should make the closing bracket being ignored.

Zlatin Zlatev
  • 3,034
  • 1
  • 24
  • 32