3

How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

For example, I have this text:

outer-start some text inner-start text-that-i-want inner-end some more text outer-end

In this case, I want text-that-i-want because it is between inner-start and inner-end, which themselves are between outer-start and outer-end.

If I have

some text inner-start text-that-i-want inner-end some more text outer-end

then I don't want text-that-i-want, because although it is between inner-start and inner-end, there is no outer-start enclosing these strings.

Likewise, if I have

outer-start some text text-that-i-want inner-end some more text outer-end

then again, I don't want text-that-i-want, because there is no enclosing inner-start, although there are enclosing outer-start and outer-end strings.

Assume that outer-start, inner-start, inner-end and outer-end will only ever be used for the purposes of enclosing/delimiting.

I reckon that I can do this by doing a two pass regular expression match, i.e. looking for any data between outer-start and outer-end, and then within that data looking for any text between inner-start and inner-end (if indeed those strings exist), but I would like to know if it can be done in one go.

Tola Odejayi
  • 3,019
  • 9
  • 31
  • 46
  • Real examples instead of these "outer-start" placeholders is likely to get you a better answer. –  Jan 02 '10 at 07:14

2 Answers2

6
/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

You need to use minimal matching to keep the regexp engine from malfunctioning when there are multiple "texts-that-i-want"s, for example:

"outer-start some text inner-start first-text-that-i-want inner-end some more text outer-end outer-start some text inner-start second-text-that-i-want inner-end some more text outer-end"

Without minimal matching, you'll get the puzzling single match, "second-text-that-i-want".

The .*? means "eat zero or more characters, but only as many as you need to to make the rest of the expression match. With the ?, a regexp engine will eat as many characters as it can as long as the rest of the expression matches.

Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
  • As a matter of fact, with greedy matching you'd get "first-text-that-i-want inner-end some more text outer-end outer-start some text inner-start second-text-that-i-want" in the capture group. – Michał Marczyk Jan 02 '10 at 06:59
  • Michal: Nope, the first (and non-grouped) `.*` eats most of the text you quoted. –  Jan 02 '10 at 07:08
  • Ouch... right. My bad, thanks for the correction. In fact, this is good reason to remove my answer and +1 this one. – Michał Marczyk Jan 02 '10 at 07:28
  • 3
    @Wayne: Why don't you edit to include the lazy version (.*?) in the pattern at the top? As your answer stands, you've got a good explanation of why .*? is to be preferred over .*, yet use .* in the high visibility example. :-) – Michał Marczyk Jan 02 '10 at 07:30
  • @Michael: Oh, that was careless of me. I tested both the good and bad regex, but when it I posted the answer, I copied and pasted the bad one. Bad programmer, no cookie! Thanks for watching my back. – Wayne Conrad Jan 02 '10 at 15:20
  • Sure, at least I can give myself a tiny cookie for that. You still get the scrumptious one for sound lazy-matching-fu. ;-) – Michał Marczyk Jan 02 '10 at 16:19
  • Wayne, thanks a lot for the answer. I actually have a follow on question - is there a way I can check to ensure that [the text between outer-start and inner-start] does not contain a specific string? i.e. I only want to return a match if this specific string is not found between outer-start and inner-start. I can open a new question if you think it's complicated... – Tola Odejayi Jan 02 '10 at 19:56
  • @Shoko: In general, just replace the `.*?` between outer-start and inner-start with a pattern which will exclude the specific string that you wish to exclude. That might be tricky or not, depending on the string in question. If in doubt, do ask a separate question. – Michał Marczyk Jan 02 '10 at 21:58
3

I imagine you can do something like:


outer-start .*? inner-start (.*?) inner-end .*? outer-end
Ben McCann
  • 18,548
  • 25
  • 83
  • 101
  • Looks like Brian beat me to posting this solution. The reason I included question marks was to save you from trouble with a greedy regex. You'll likely want to include them. – Ben McCann Jan 02 '10 at 06:48