1

string: abc keyword1 ddd 111 ddd (ddd 99/ddd) 1 ddd (ddd) ddd 11 ddd keyword2 abc

regex: re.compile(r'(?:keyword1)(.*)(?:keyword2)', flags = re.DOTALL | re.MULTILINE)

goal: exclude all digits except the ones within brackets from match

desired output: 'ddd ddd (ddd 99/ddd) ddd (ddd) ddd ddd'

approach1: Any digit within brackets is always 99 but the digits outside of brackets can also be 99. That is why i could also remove every digit from matching, except 99 and subsequently use not regex to remove the remaining 99s outside of brackets?!

approach2: match ddd (basically everything including 99s) except all other digits using some variant of the help below. I played with the (\([^)]*\)|\S)* around but failed prob because its java :D

Question: Which approach makes sense? How can i modify my regex to reach my goal?

related help Exclude strings within parentheses from a regular expression? (\([^)]*\)|\S)* where one balanced set of parentheses is treated as if it were a single character, and so the regex as a whole matches a single word, where a word can contain these parenthesized groups.

id345678
  • 97
  • 1
  • 3
  • 21
  • Do you mean to get any digit chunks inside parentheses? As with `\d+(?=[^()]*\))` (assuming all parentheses are paired in your strings)? – Wiktor Stribiżew Apr 11 '22 at 09:03
  • Yes, i want to exclude any digits (represented by the `1`s) but include the `99/` chunk in my match as it would be lost if i simply remove any digits from the string – id345678 Apr 11 '22 at 09:04
  • Then, do you mean you want `re.findall(r'\b99/\d+(?=[^()]*\))', text)`? – Wiktor Stribiżew Apr 11 '22 at 09:06
  • 2
    Maybe you can include what exactly would be your desired result? Also, you are matching between keywords, does that mean that 'ddd' is a keyword and you are looking for digits between those? – JvdV Apr 11 '22 at 09:08
  • 2
    See https://ideone.com/xPuJY2 – Wiktor Stribiżew Apr 11 '22 at 09:28

1 Answers1

2

Without any additional packages, you can use a two step approach: get the string between keywords and then remove all digit chunks that are not inside parentheses:

import re
s = "abc keyword1 ddd 111 ddd (ddd 99/ddd) 1 ddd (ddd) ddd 11 ddd keyword2 abc"
m = re.search(r'keyword1(.*?)keyword2', s, re.I | re.S)
if m:
    print( re.sub(r'(\([^()]*\))|\s*\d+', r'\1', m.group(1)) )

## => ddd ddd (ddd 99/ddd) ddd (ddd) ddd ddd

See the Python demo.

Notes:

  • keyword1(.*?)keyword2 extracts all contents between keyword1 and keywor2 into Group 1
  • re.sub(r'(\([^()]*\))|\s*\d+', r'\1', m.group(1)) removes any digit chunks preceded with optional whitespace from the Group 1 value while keeping all strings between ( and ) intact.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563