1

I am struggling with Regex, I have read the wiki and played around, but I cant seem to make the right match.

string_before = 'President [Trump] first name is [Donald], so his full name is [[Donald] [Trump]]' 
string_after = 'President [Trump] first name is [Donald], so his full name is [Donald Trump]' 

I want to remove any possible brackets inside the outer brackets while keeping the outer brackets and the text inside.

Could this be solved easy in python without regex?

Isbister
  • 906
  • 1
  • 12
  • 30
  • Regex is not that good for dealing with nesting. – khelwood Feb 17 '17 at 09:04
  • Where are you getting text with these brackets to begin with? – Blender Feb 17 '17 at 09:12
  • I have done named entity tagging. And names are tagged with [ ] around them. So in this case, the tagger belives we have 3 diffrent entities since Donald is a entity, Trump is a entity and Donald Trump is another entity. This is a special case when, 'Donald' might been mentioned in the beginning of the text and 'Trump' in the middle and then the new combination 'Donald Trump' in the end. – Isbister Feb 17 '17 at 09:17

3 Answers3

1

Regex will cause you more harm than good for such problems. You will need to write some parsing logic based on grammar or rules.

You could for example take a look at Finite-State Transducers (1, 2), which would be a suitable method of parsing nested constructions, but it's more complex than Regex to understand and use.

Community
  • 1
  • 1
Matt Fortier
  • 1,213
  • 1
  • 10
  • 18
1

In the specific case of two adjacent bracketed expressions inside a pair of brackets, you can do

string = re.sub(r'\[\[([^][]+)\] \[([^][]+)\]\]', r'[\1 \2]', string)

This does not conveniently extend to an arbitrary number of adjacent bracketed expressions, but perhaps it's enough for your needs.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0
In [1]: import re
In [2]: before='blablabla [[Donald] [Trump]] blablabla'
In [3]: l=before.find('[')+1
In [4]: r=before.rfind(']')
In [5]: before[:l] + re.sub( r'[][]','',before[l:r]) + before[r:]
Out[5]: 'blablabla [Donald Trump] blablabla'

Just show one way to go, error checking/handling was omitted.

Kent
  • 189,393
  • 32
  • 233
  • 301
  • Cool, yeah it solves that specific case. I did not ellaborate my examples enough. Since they can look like: "I think [Donald] is the first name of the president [Trump] but some people call him [[Donald] [Trump]] so he shall be called [[Donald] [Trump]]" I will update my question. – Isbister Feb 17 '17 at 10:11