4

I need help creating the best possible regular expression for this problem.

I have combinations / sets of Starting and End Delimeters and I need to get ALL the substring / any words between the starting delimeter upto the end delimeter.

Assume this table of Delimeters:

START | END

CAT | DOG

APPLE | ORANGE

LION | ZEBRA

PANDA | CAT

sample Input:

substring1 CAT substring2 substring3 DOG substring4 substring5 CAT substring6
APPLE substring7 substring 8 ORANGE ORANGE substring9 DOG substring10 PANDA
substring11 CAT substring12 DOG substring13 LION substring10 substring11 ZEBRA substring12
CAT substring13 substring14 APPLE substring15 substring 16 ORANGE

The output must be:

  1. CAT substring2 substring3 DOG
  2. APPLE substrin7 substring8 ORANGE
  3. PANDA substring 11 CAT
  4. LION substring10 substring 11 ZEBRA
  5. APPLE substring15 substring16 ORANGE

My regular expression:

 CAT (.)*? DOG | APPLE (.)*? ORANGE | LION (.)*? ZEBRE |  PANDA (.)*? CAT 

I have problem dealing with string that has multiple occurence of other starting delimeter.

take for example:

CAT word1 word2 word3 word4 APPLE word5 word6 word7 DOG 

I know that it will match with this CAT (.)*? DOG but this is wrong since the substring contains one of the starting delimeters.

I just need a regex that that will get all the words between a starting delimeter upto its matching end delimeter if ever the substring does not contain any occurence of other starting delimeters.

any suggestion? Thanks

nfinium
  • 141
  • 3
  • 8

2 Answers2

2

The technique that helps us here is called "lookaround".

I Updated my answer after clarification of nfinium and feedback from jsobo

CAT ((?!(APPLE|LION|PANA)).)*? DOG|APPLE ((?!(CAT|LION|PANDA)).)*? ORANGE|LION ((?!(CAT|APPLE|PANDA)).)*? ZEBRA|PANDA ((?!(APPLE|LION)).)*? CAT

Given the input:

substring1 CAT substring2 substring3 DOG substring4 substring5 CAT substring6 APPLE substring7 substring 8 ORANGE ORANGE substring9 DOG substring10 PANDA substring11 CAT substring12 DOG substring13 LION substring10 substring11 ZEBRA substring12 CAT substring13 substring14 APPLE substring15 substring 16 ORANGE  string CAT dkdkdkdkdk CAT dkdkdk dkdkdk ORANGE dkdkdkdk DOG etc. CAT word1 word2 word3 word4 APPLE word5 word6 word7 DOG wordx

It matches

CAT substring2 substring3 DOG
APPLE substring7 substring 8 ORANGE
PANDA substring11 CAT
LION substring10 substring11 ZEBRA
APPLE substring15 substring 16 ORANGE
CAT dkdkdkdkdk CAT dkdkdk dkdkdk ORANGE dkdkdkdk DOG

Specificaly, it will not match the following as indicated by nfinium

CAT word1 word2 word3 word4 APPLE word5 word6 word7 DOG 

And also matches as you pointed out

CAT dkdkdkdkdk CAT dkdkdk dkdkdk ORANGE dkdkdkdk DOG 

You say that it should match the following

CAT substring12 DOG

but I dont think it should not since the CAT from above is the end delimiter of

PANDA substring11 CAT

This regex produces the expected result of nfinium

Note that as per the requirments of nfinium CAT can be a starting and an ending delimiter

CAT | DOG
PANDA | CAT
buckley
  • 13,690
  • 3
  • 53
  • 61
  • This doesn't find... CAT substring12 DOG on the 3rd line... it also doesn't deal with the following string CAT dkdkdkdkdk CAT dkdkdk dkdkdk ORANGE dkdkdkdk DOG etc... but the ideas is close. – John Sobolewski May 18 '12 at 11:49
  • @jsobo My answer was not complete. I tried to hint at a possible solution but I agree there were more challenges to be overcome. I read upon the extra requirements and update my regex. Can you do a review cause I took into account your feedback as well. – buckley May 18 '12 at 13:29
  • I will maybe get a chance to look at this again later... but it makes your regex more readable IMHO when a single space is expressed as [ ]... so "\d[ ]\d" is more explicit than "\d \d" because with the square brackets you know it is 1 space.. without you have to click in and move your cursor to be sure. – John Sobolewski May 18 '12 at 20:23
  • The character class trick to signals a space is a new trick to me. It makes the regex a bit longer and cryptic for some. BTW I did loose the OP spec that a space should be before the opening delimiter and after the closing delimiter since it only adds noise and is not essential to the solution. – buckley May 18 '12 at 21:15
  • Mostly works... it doesn't work on this line... CAT substring2 CAT substring6 substring3 DOG You need to add the opening word to your negative look ahead – John Sobolewski May 21 '12 at 11:38
  • BTW alot of work has probably went into answering this question.. you probably should give some upvotes to the folks who pointed you in the direction of the answer... – John Sobolewski May 21 '12 at 11:41
  • @jsobo Thanks for the review. Currently it matches the whole of "CAT substring2 CAT substring6 substring3 DOG". You are assuming it should match the substring "CAT substring6 substring3 DOG" right? It's not clear to me that the OP wants this but your comment is right if it should. I make the same assumption (longest match) BTW in the last entry of my sample input. Tanks again for being my sounding board :) – buckley May 21 '12 at 13:00
  • requirements said... "if ever the substring does not contain any occurence of other starting delimeters." I guess the word "OTHER" does make it a bit ambiguous... I think the way you interpreted it is correct based on the requirement. However I think that probably is not what he wants... ;-) – John Sobolewski May 21 '12 at 13:39
0

I think the key to this is the 2nd and output of:

 "APPLE substrin7 substring8 ORANGE" 

which is contained in:

 "CAT substring6 APPLE substring7 substring 8 ORANGE ORANGE substring9 DOG"

so bascially you have to catch Cat not followed by APPLE | ORANGE | LION | ZEBRA | PANDA | CAT as those would start another group. This is potentially possible but writing a regex to do this is akin to trying to parse HTML with a regex.

See: RegEx match open tags except XHTML self-contained tags

It could be done but the regex is going to be very complicated this problem is best handled in a code...

Here is an example of what I think you want that handles the first two start end combos.

(CAT(?!.+(?:APPLE|ORANGE|LION|ZEBRA|PANDA|CAT).+DOG).*?DOG)|(APPLE(?!.+(?:APPLE|LION|ZEBRA|PANDA|CAT|DOG).+ORANGE).*?ORANGE)

Just the first group is...

(CAT(?!.+(?:APPLE|ORANGE|LION|ZEBRA|PANDA|CAT).+DOG).*?DOG)

so you can see if this had more combinations it just gets very verbose.

Community
  • 1
  • 1
John Sobolewski
  • 4,512
  • 1
  • 20
  • 26
  • My solution is still flawed in that it doesn't handle WORDs... in other words... SUPERCAT item1 item2 AwesomeDOG would get captured. Also beginning and end of line rule issues need to be addressed – John Sobolewski May 18 '12 at 11:56
  • Also this will not find every occurance just the last ones... so if you can multiple groups of cat/dog on one line it will not find it.. – John Sobolewski May 18 '12 at 11:58