Search text with a regular expression to match outside specific characters

Question

I have text that looks like:

My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)

The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.

So in the text above, the result I am looking for is:

Richard
Robert
Jill

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jonas Elfström, Mar 18 '10 at 16:45

score 3 · Answer 1 · answered Mar 18 '10 at 16:52

3

You can do it in two steps:

step1: match all bracket contents using:

\[[^\]]*\]

and replace it with ''

step2: match all the remaining parenthesized names(globally) using:

\([^)]*\)

answered Mar 18 '10 at 16:52

codaddict

445,704
82
492
529

1

Yes, you can, but that wouldn't be that much fun would it? – user187291 Mar 18 '10 at 17:23

score 2 · Accepted Answer · answered Mar 18 '10 at 17:20

You didn't say what language you're using, so here's some Python:

>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))

The output is:

['Richard', 'Robert', 'Jill']

One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)

The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.

score 1 · Answer 3 · answered Mar 18 '10 at 16:53

1

IF you are using .NET you can do something like:

"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"

answered Mar 18 '10 at 16:53

Paulo Santos

11,285
4
39
65

Won't this fail to pick Robert out of the example? The lookbehind will find the `[` that contains Jack, and the lookahead will find Betty's `]`. The .s would need to be replaced with `[^\]]` and `[^\[]` respectively, I guess. Some regex engines don't support non-fixed-width negative lookbehinds, either. – Chris Mar 18 '10 at 17:01
it's a **negative** look ahead and behind – Paulo Santos Mar 18 '10 at 17:05
I'm aware of this. Thinking about it more, I think this will fail to pick any names at all from the input - have you actually tried it? ;) For all names except Richard, the negative lookbehind will cause the match to fail (as `\[.*?` can trivially be matched ending at the start of all the other names), and for all except Jill the negative lookahead will cause it to fail for similar reasons. – Chris Mar 18 '10 at 17:13
@Chris is right: it doesn't work as-is, and after making the changes he suggested it will only work in .NET or JGSoft (EditPad Pro, PowerGrep, etc.), because they're the only flavors that support unbounded lookbehind. Also, you've got the negative-lookahead syntax wrong. :-/ – Alan Moore Mar 18 '10 at 23:12

score 0 · Answer 4 · answered Mar 18 '10 at 16:52

It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.

Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).

A quick (PHP) test case:

preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);

print(implode(", ", $m[1]));

Outputs:

Richard, Robert, Jill

@Paulo Santos: I don't know if it's that people "forget" about them, or if it's just that most people have a hard time getting negative assertions to work the way they expect, and so would rather just avoid using them. — Laurence Gonsalves, Mar 18 '10 at 17:05
@Paulo: some of us just *wish* we could forget about them. :P Lookbehinds in particular are both much trickier and much less useful than many people expect them to be. — Alan Moore, Mar 18 '10 at 23:29

ghostdog74 · Answer 5 · 2010-03-18T17:23:27.290

0

>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
...     st = item.split("[")[0]
...     if ")" in st:
...         for i in  st.split(")"):
...             if "(" in i:
...                print i.split("(")[-1]
...
Richard
Robert
Jill

edited Mar 18 '10 at 17:23

answered Mar 18 '10 at 17:09

ghostdog74

327,991
56
259
343

Alan Moore · Answer 6 · 2010-03-18T23:32:58.127

So you want the regex to match the name, but not the enclosing parentheses? This should do it:

[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)

As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.

I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.

Search text with a regular expression to match outside specific characters

6 Answers6