0

I want to extract and locate the words within all brackets/braces in a sentence, but I am currently having trouble with overlapping brackets. e.g.:

[in]: sentence = '{ia} ({fascia} antebrachii). Genom att aponeurosen fäster i armb'
[in]: pattern = r"\[([^\[\]()]+?)\]|\(([^\[\]()]+?)\)|\{([^\[\]()]+?)\}"
[in]: [(m.start(0), m.end(0), sentence[m.start(0) : m.end(0)]) for m in re.finditer(pattern, sentence)]
[out]: [(0, 4, '{ia}'), (5, 27, '({fascia} antebrachii)')]

It should identify 3 instances and correct indices. Any advice pls?

lemon
  • 14,875
  • 6
  • 18
  • 38
Blue482
  • 2,926
  • 5
  • 29
  • 40

1 Answers1

1

Try using the regex module. It can deal with overlapped strings:

import regex as re

sentence = '{ia} ({fascia} antebrachii). Genom att aponeurosen fäster i armb'
pattern = '{[^{}]+}|\[[^\[\]]+\]|\([^\(\)]+\)'

[(m.start(0), m.end(0), sentence[m.start(0) : m.end(0)]) for m in re.finditer(pattern, sentence, overlapped=True)]

There's also a simplified regex pattern, that matches...

  • everything that is not a brace among braces: {[^{}]+},
  • everything that is not a bracket among brackets: \[[^\[\]]+\]
  • everything that is not a parenthesis among parentheses: \([^\(\)]+\)
lemon
  • 14,875
  • 6
  • 18
  • 38