1

How to exclude commas that are in small (less than 20 characters) parentheses?

Get index of this comma, but (not this , comma). Get other commas like, or ,or, 1,1 2 ,2. (not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)

Expected output for this example all indices of commas: [23, 71, 76, 79, 82, 87, 132]

enter image description here

Neret
  • 143
  • 1
  • 8
  • 1
    Could you please clarify your question? Specifically, could you provide an input sample with your expected output? – Blake G Dec 13 '20 at 16:54
  • The syntax you're requesting is not, formally, a regular language -- that's why all the answers use extensions like backreferences that aren't part of "real" regular expression languages. There's a great deal of academic literature about what a true regex can and can't match -- see https://en.wikipedia.org/wiki/Regular_language for a high-level intro -- and when you get into extensions, the classic (very fast) algorithms may no longer work -- meaning matching can involve backtracking and thus get slower or more memory-intensive. See also https://swtch.com/~rsc/regexp/regexp1.html – Charles Duffy Dec 13 '20 at 22:10

4 Answers4

2

Regex pattern: (,)|(\([^()]{0,20}\))

Intuition behind this pattern:

  • (,) looks for all commas. These are stored in capturing group 1.

  • (\([^()]{0,20}\)) looks for all parentheses with at most 20 characters in between. These are stored in capturing group 2.

We can then find all matches from group 1 only to exclude those commas within parentheses of length 20.

Now to find the indices for these matches, use re.finditer() combined with Match.start() and Match.group() to find the starting index for each match from group 1:

import re

string = """Get index of this comma, but (not this , comma). Get other commas like , or ,or, 1,1 2 ,2.
(not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"""

indices = [m.start(1) for m in re.finditer('(,)|(\([^()]{0,20}\))', string) if m.group(1)]

print(indices)
# > [23, 71, 76, 79, 82, 87, 132]
print([string[index] for index in indices])
# > [',', ',', ',', ',', ',', ',', ',']

m.start(1) returns the starting index for group 1 matches. Since re.finditer() returns matches from all capturing groups, adding if m.group(1) requires that a match is found for group 1 (matches from other groups are None).

Edit: This ignores parentheses with 20 or fewer characters inside, which is not consistent with your first statement but is consistent with what the example explains. If you want less than 20, just use {0,19}.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
brianrice
  • 21
  • 1
  • 6
  • 1
    Please always use the minimum threshold value inside limiting quantifiers, it is best practice and will help avoid issues if a regex is used with a regex library that is not able to parse such quantifiers with missing min value. – Wiktor Stribiżew Dec 13 '20 at 20:50
2

You could also use the PyPi regex module with SKIP FAIL to match and exclude the characters that you don't want in the match result.

In this case, you can match 1-20 between parenthesis where the comma should not be matched.

\([^()]{1,20}\)(*SKIP)(*FAIL)|,

Explanation

  • \( Match (
  • [^()]{1,20} Match 1-20 times any char except ( or )
  • \) Match )
  • (*SKIP)(*FAIL) Exclude the characters from the match result
  • | Or
  • , Match a comma

Regex demo | Python demo

Example code

import regex

s = """Get index of this comma, but (not this , comma). Get other commas like , or ,or, 1,1 2 ,2.
(not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"""
pattern = r"\([^()]{1,20}\)(*SKIP)(*FAIL)|,"
indices = [m.start(0) for m in regex.finditer(pattern, s)]
print(indices)

Output

[23, 71, 76, 79, 82, 87, 132]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

Use PyPi regex:

,(?![^()]*\))|(?<=\((?=[^()]{20})[^()]*),

See proof.

Python code:

import regex
text = r"Get index of this comma, but (not this , comma). Get other commas like, or ,or, 1,1 2 ,2. (not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"
reg_expression = r',(?![^()]*\))|(?<=\((?=[^()]{20})[^()]*),'
print(regex.sub(reg_expression, '<COMMA>\g<0></COMMA>', text))
# Get index of this comma<COMMA>,</COMMA> but (not this , comma). Get other commas like<COMMA>,</COMMA> or <COMMA>,</COMMA>or<COMMA>,</COMMA> 1<COMMA>,</COMMA>1 2 <COMMA>,</COMMA>2. (not this ,) BUT (get index of this comma<COMMA>,</COMMA> if more than 20 characters are inside the parentheses)
indices = [x.start() for x in regex.finditer(reg_expression, text)]
print(indices)
# [23, 70, 75, 78, 81, 86, 131]

Expression explanation

--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \)                       ')'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    \(                       '('
--------------------------------------------------------------------------------
    (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
      [^()]{20}                any character except: '(', ')' (20
                               times)
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  ,                        ','
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
0

You could just use a for-loop with some if statements. This is not an ideal code, but gets you the answer. Here is an example:

textString = 'Get index of this comma, but (not this , comma). Get other commas like , or ,or, 1,1 2 ,2.(not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)'
parFlag = False #flag to check ()
commas = []
lastPar = 0 #last seen ()
for i in range(len(textString)):
    if(textString[i]=='('):
        parFlag = True
        lastPar = i
    if(textString[i]==')' or i-lastPar>=20):
        parFlag = False
    if( textString[i] == ',' and not parFlag):
        commas.append(i)

vegiv
  • 124
  • 1
  • 9