2

With regular expression I want to detect text/string between starting and ending double curly braces and it should detect any inner curly braces along with text.

for example:

{{detect this {{and this as well}} text}} but text does not ends here so it should {{not detect this}}.

I have written this regexp

\{\{[\s\S]+\}\}

but this selects the whole string FROM {{detect this.... TO {{not detect this}}

Note: I am using python re for this

Fabrício Matté
  • 69,329
  • 26
  • 129
  • 166
curious
  • 23
  • 2
  • You need to parse this, you cannot RegEx this... – thefourtheye Jan 05 '14 at 17:22
  • 2
    @thefourtheye It's not a regular language, but I would not be surprised at all if it's possible to recognize with `re`. Most things called "regex" in modern programming are more powerful than finite automata. Whether that's *advisable* is another question entirely though. –  Jan 05 '14 at 17:23
  • 1
    If you have the [regex](https://pypi.python.org/pypi/regex) module from python 3.X, you could use [this](http://regex101.com/r/qW0sI3). – Jerry Jan 05 '14 at 17:24
  • 1
    @delnan: no it's impossible with python regex since it doesn't support recursion. – Casimir et Hippolyte Jan 05 '14 at 17:24
  • @CasimiretHippolyte You don't *need* recursion for *any* computation, it's equivalent to certain combinations of iteration + state. And Python regexes at least have some state in the form of backreferences. To be clear, I do not know if that's enough to parse nesting, but I do know that the obvious counter arguments do not apply. –  Jan 05 '14 at 17:26
  • Any "regular" expression that you would write for this is going to be fragile. Write a simple parser that counts up when it sees `{{` and down when it sees `}}`. – roippi Jan 05 '14 at 17:27
  • related: http://stackoverflow.com/questions/1099178/matching-nested-structures-with-regular-expressions-in-python – Ryan Haining Jan 05 '14 at 17:30
  • @delnan: all that you can do in pure regex without the recursion feature is to create a pattern for a fixed maximum depth level. However, Jerry seems to have found an alternative regex module that supports recursion. – Casimir et Hippolyte Jan 05 '14 at 17:30
  • @CasimiretHippolyte This area is sufficiently complicated that some rigor is useful. Can you *prove* that? If not, I have a hard time believing the claim - it sounds reasonable, but many false statements about formal languages sound reasonable. –  Jan 05 '14 at 17:34
  • @Jerry Yes, I know how to match limited-depth nesting, and I need no convincing that it's possible as it fits into finite automata. What I'd like to be convinced of is the claim: Regular expressions with backreferences but no recursion cannot recognize unlimited nesting. –  Jan 05 '14 at 17:44
  • @delnan Oh! Okay, I can't prove it nor do I have the strength to start trying to prove it ^^; – Jerry Jan 05 '14 at 17:46
  • @delnan to prove it, you need to understand the recursive pattern (or balancing groups in .NET). You'll realise that it's impossible with only backreferences. Explaining it is a bit tiresome and requires a lot of writing. You may want to read this [article](http://www.rexegg.com/regex-recursion.html) and this [answer](http://stackoverflow.com/a/17004406). – HamZa Jan 05 '14 at 18:58
  • @Jerry That regex would fail for unbalanced `{}`. You need to exclude both combinations of `{{` & `}}`. So [here's a pattern](http://regex101.com/r/jD1bB6). – HamZa Jan 05 '14 at 19:02
  • 1
    @HamZa I knew that, it's just that I didn't want to use negative lookaheads and keep it simple ^^; – Jerry Jan 05 '14 at 19:05
  • @HamZa That's all very interesting but matches my previous understanding of recursive patterns exactly. I understand that these extensions make unlimited nesting easy to handle. But I don't see how any of this implies backreferences can't *also* solve this specific problem. As an analogy, it's very nice that birds can fly but that alone doesn't rule out the possibility of airplanes, even if airplanes may be more complicated than birds. Could you elaborate? –  Jan 05 '14 at 19:12
  • @delnan I'll try to convince you: backreferences hold exactly what was matched previously. So `(a|b)\1`, will match either `aa` or `bb`. So what about recursive regexes ? Well `(?R)` is in fact the same as `(?0)`, which means repeat the regular expression in group 0, group 0 is in fact the whole regex. So `(a|b)(?1)` would basically not just match `aa` & `bb`, but it will also match `ab` & `ba`. So calling `(?1)` is way more powerful since it will repeat the regex and not what was matched. When you got nested characters, you want to repeat the regex and not what was matched. – HamZa Jan 05 '14 at 19:21
  • @HamZa Sounds reasonable. A formal proof would feel better, but apparently. But since the theoretical CS guys seem to have [trouble with that](http://cstheory.stackexchange.com/q/1047) too (note the use of "seems" in the discussion of balanced parentheses), I'm afraid I won't get one. –  Jan 05 '14 at 19:23
  • @delnan I can't give you any theoretical CS stuff since I didn't learn regex in that way. Feel free to chat in the [regex chatroom](http://chat.stackoverflow.com/rooms/25767/regex) – HamZa Jan 05 '14 at 19:38
  • @delnan: It's indeed an interesting question, If I were you, I will post a question in computer science – Casimir et Hippolyte Jan 05 '14 at 20:11

2 Answers2

3

Pyparsing allows you to define recursive grammars, but has some builtin helpers for common ones like this. See annotated code sample below:

from pyparsing import nestedExpr, ungroup, originalTextFor

# use nestedExpr to define a default expression with left-right nesting markers
nestedText = ungroup(nestedExpr('{{','}}'))

sample = """{{detect this {{and this as well}} text}} but text does not ends here so it should {{not detect this}}."""

# note how reporting the results as a list keeps the nesting of {{ }}'s
print nestedText.parseString(sample).asList()
# prints ['detect', 'this', ['and', 'this', 'as', 'well'], 'text']

# if you just want the string itself, wrap with 'originalTextFor'
print originalTextFor(nestedText).parseString(sample)[0]
# prints {{detect this {{and this as well}} text}}
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
0

First of all {{[\s\S]+}} is (almost) the same as {{.+}}. Reason: \s contains all spaces and \S contains everything that is not a space. I would generally avoid the upper case character classes in [], it will mostly cause confusion.

Secondly: I think I am on board with thefourtheye, I cannot quickly think of a RegEx to solve your problem.

Malhelo
  • 399
  • 6
  • 16
  • 1
    In many regex flavors, `.` does not match newlines without a special flag (flag which is not available in some languages) so `[\s\S]` as well as `[^]` is often a workaround for that. – Fabrício Matté Jan 05 '14 at 17:31
  • True, but as OP was talking about using python re, there is a flag for dotall, so I would assume it to be "cleaner" to write .+ and activate dotall, for it is more obvious. – Malhelo Jan 05 '14 at 17:33
  • True, most languages have a `dotall` flag which is indeed a cleaner solution, but sometimes (depending on the use case/regular expression) you may want the `.` to not match newlines, or keep the expression portable to another given language. I just commented for future starters that may visit the `regex` tag. – Fabrício Matté Jan 05 '14 at 17:37