9

I'm trying to grab any text outside of brackets with a regex.

Example string

Josie Smith [3996 COLLEGE AVENUE, SOMETOWN, MD 21003]Mugsy Dog Smith [2560 OAK ST, GLENMEADE, WI 14098]

I'm able to get the text inside the square brackets successfully with:

addrs = re.findall(r"\[(.*?)\]", example_str)
print addrs
[u'3996 COLLEGE AVENUE, SOMETOWN, MD 21003',u'2560 OAK ST, GLENMEADE, WI 14098']    

but I'm having trouble getting anything outside of the square brackets. I've tried something like the following:

names = re.findall(r"(.*?)\[.*\]+", example_str)

but that only finds the first name:

print names
[u'Josie Smith ']

So far I've only seen a string containing one to two name [address] combos, but I'm assuming there could be any number of them in a string.

Banjer
  • 8,118
  • 5
  • 46
  • 61

4 Answers4

12

If there are no nested brackets, you can just do this:

re.findall(r'(.*?)\[.*?\]', example_str)

However, you don't even really need a regex here. Just split on brackets:

(s.split(']')[-1] for s in example_str.split('['))

The only reason your attempt didn't work:

re.findall(r"(.*?)\[.*\]+", example_str)

… is that you were doing a non-greedy match within the brackets, which means it was capturing everything from the first open bracket to the last close bracket, instead of capturing just the first pair of brackets.


Also, the + on the end seems wrong. If you had 'abc [def][ghi] jkl[mno]', would you want to get back ['abc ', '', ' jkl'], or ['abc ', ' jkl']? If the former, don't add the +. If it's the latter, do—but then you need to put the whole bracketed pattern in a non-capturing group: r'(.*?)(?:\[.*?\])+.


If there might be additional text after the last bracket, the split method will work fine, or you could use re.split instead of re.findall… but if you want to adjust your original regex to work with that, you can.

In English, what you want is any (non-greedy) substring before a bracket-enclosed substring or the end of the string, right?

So, you need an alternation between \[.*?\] and $. Of course you need to group that in order to write the alternation, and you don't want to capture the group. So:

re.findall(r"(.*?)(?:\[.*?\]|$)", example_str)
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • What if there's any text *after* the last pair of brackets? (referring only to your regex; your splitting solution works) – Tim Pietzcker Jun 24 '13 at 21:06
  • Ah yes, that all makes sense. I like the `split` solution better myself. – Banjer Jun 24 '13 at 21:10
  • @TimPietzcker: You can add that in the same style as the OP's original regex; the bit at all complicated is that the obvious way to write it needs a non-capturing group. Edited the answer to show how. – abarnert Jun 24 '13 at 21:14
  • The `\]+` is wrong anyhow, because it repeats only the bracket, not the bracketed text. – Janne Karila Jun 25 '13 at 06:32
  • Good point. As I said, I don't think the OP wanted it anyway, but I'll edit the answer. – abarnert Jun 25 '13 at 17:49
5

If there are never nested brackets:

([^[\]]+)(?:$|\[)

Example:

>>> import re
>>> s = 'Josie Smith [3996 COLLEGE AVENUE, SOMETOWN, MD 21003]Mugsy Dog Smith [2560 OAK ST, GLENMEADE, WI 14098]'
>>> re.findall(r'([^[\]]+)(?:$|\[)', s)
['Josie Smith ', 'Mugsy Dog Smith ']

Explanation:

([^[\]]+)   # match one or more characters that are not '[' or ']' and place in group 1
(?:$|\[)    # match either a '[' or at the end of the string, do not capture
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • This one works better because it doesn't return an empty string like @abamert's does – Capt. Crunch Mar 08 '17 at 06:14
  • This captures the ending **[** isn't? Or at least that's what I get: https://regex101.com/r/G43H7X/1 I think that a slightly better versing would be `([^[\]]+)(?:$|(?=\[))` so that only the content is extracted – adelriosantiago Feb 25 '19 at 23:18
3

If you want to go with regex and still handle nested brackets, you can go with:

import re
expr = re.compile("(?:^|])([^[\]]+)(?:\[|$)")

print(expr.findall("myexpr[skip this[and this]]another[and skip that too]"))

This will yield ['myexpr', 'another'].

The idea is to match anything between the beginning of the string or a ] and the end of the string or a [.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Faibbus
  • 1,115
  • 10
  • 18
2

you can do this:

 outside = re.findall(r"[^[]+(?=\[[^]]*]|$)", example_str)

In other words: All that is not an opening square bracket followed by something inside square brackets or the end of the string

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125