3

I need to fashion a regex with the following requirements:

Given sample text:

SEARCH_TERM_#1 find this text SEARCH-TERM_#2_more text_SEARCH-TERM_#3
SEARCH_TERM_#1 find this text SEARCH-TERM_#3

I want to extract the string which appears in the find this text area

The regex should collect data after SEARCH_TERM_#1 upto but not including SEARCH_TERM_#2 or SEARCH-TERM_#3 which ever comes first. It should choose as the 'right-side' search border whatever it finds first of #2 and #3.

I've tried (?>SEARCH_TERM_#2|SEARCH_TERM_#3) (?=(?>SEARCH_TERM_#2|SEARCH_TERM_#3)) and (?>(?=SEARCH_TERM_#2)|(?=SEARCH_TERM_#3)) . And they ALL include the second search term into the collected data and stop before the third, while I want the collected data stop before the #2 or #3 which ever comes first.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
a_hanif
  • 157
  • 7

2 Answers2

6

Description

This regular expression will:

  • find the first SEARCH_TERM_#1
  • capture text starting after SEARCH_TERM_#1
  • stop capturing text when it encounters either SEARCH_TERM_#2 or SEARCH_TERM_#3 (which ever is first

^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)

enter image description here

Expanded

  • ^ match the begining of the string, this forces the search to start at the beginning
  • .*? match all characters upto the next expression. note this term should be used in conjuction with the s option which allows the dot to match new line characters
  • SEARCH_TERM_\#1 the first search term
  • ( start the capture group this set of parentheses puts the matched values into the capture group 1
  • (?: start non capture group, this the real magic, and basically allows the contained expression to continue matching until it stumbles on either SEARCH-TERM_\#2 or SEARCH-TERM_\#3
    • (?! start the negative lookahead. think of the regex engine moving a cursor through the input string. The loohahead simply looks at the characters after the cursor without moving the cursor. The negative means that if the found expression resolves as matched then deny the match, or if the expression is not found, then allow the match.
    • SEARCH-TERM_\#2|SEARCH-TERM_\#3 look for either value. the | is an "or" statement
    • ) close the negative lookahead
    • . match any character. The expression only gets to this spot if the preceding negative lookahead didn't find it's search terms
    • ) close the non capture group, at this point either the searching as stopped because it encountered the #2 or #3 end condition or the non capture group found a single character
  • * continue greedily matching all characters. You can use greedy because the end condition is contained inside the expression.
  • ) close the capture group

    PHP code example

You didn't specify a language so I'm including this PHP example only to show how it works.

Input Text

skip this text SEARCH_TERM_#1 find this text SEARCH-TERM_#2 more text to ignore SEARCH_TERM_#3

Code

<?php
$sourcestring="your source string";
preg_match('/^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

$matches Array:
(
    [0] => skip this text SEARCH_TERM_#1 find this text 
    [1] =>  find this text 
)

Real World Example

Or to use your real world example included in the comments:

Regex: ^.*?style="background-image: url\(((?:(?!&cfs=1|\)).)*)

Input text: <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots&cfs=1)">

Matches:

[0] => <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots
[1] => http://I.like.kittens.com?Name=Boots

Disclaimer

This vaguely looks like common problem in parsing HTML using regex. If your input text is HTML then you should investigate using an HTML parsing tool rather then a regular expression.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Denomales, that is amazing! - After trying so many times all these regular expressions which so overtly bordered on insanity, that simple string of yours worked like a charm. What I'm dealing with is HTML code, and while you advised me to turn to html parsers, I'd still like to know how the regex you provided works since it obviously works pretty well on the html files that I deal with ... So if you don't mind spending more time on my question, I'd like to politely inquire you a bit more on it. – a_hanif Jun 23 '13 at 15:33
  • And namely: (?:(?! | ).)*) - What does this expression mean ? Please correct these positions, if they're wrong: 1) () - a capturing group, referenced by \1 2) (?:(?! | ).) - a non-capturing group containing an alternation of negative lookaheads, which is followed by a random character 3) Does a capturing group containing a non-capturing group with a negative lookahead(s) actually matches text that is followed by the expressions inside these lookaheads? 4) And how is the condition of stop searching further upon finding one of the two lookaheads installed here? – a_hanif Jun 23 '13 at 15:33
  • Finally, you mentioned an html parser. Does it require learning another programming language? And in case one can begin using it without prior knowledge, could you perhaps point to a software, web-site, .. that would serve as an easy start for me? PS: sorry, I couldn't rate your reply, as my reputation is only 6 where 15 points for this action is required – a_hanif Jun 23 '13 at 15:33
  • Q1: \1 can refer to the capture group; Q2: yes; Q3: the lookahead is not capturing, it simply looks to see if it's contained expression would match and doesn't move the cursor (it's like telling someone who is walking down the sidewalk "don't cross the road but when you reach the intersection stop"); Q4: the negative lookahead is like saying "not matches the contained expression" whereas a positive lookeahead `(?=`...`)` is saying "must match the contained expression" – Ro Yo Mi Jun 23 '13 at 16:06
  • Regarding the parser. Yes there is a learning curve, but most languages have a parsing engine built in. The advantages to using a parser is that it'll read any properly formatted html and will handle deeply nested tags which are in random orders. In a couple of lines of code you could parse very specific items from HTML. HOWEVER COMMA if your html is not properly formatted or your string searching is relatively easy, then using a regular expression will probably meet your needs. – Ro Yo Mi Jun 23 '13 at 16:11
  • Denomales, thanks for the extended explanations, very much obliged. And also, I forgot to thank you for the graphic.. bet there is a special software for that kind of thing .. If I got your explanations correctly, the lookaheads just look what is to the left-right of the cursor and report on that; the * sign is what keeps the machine rolling along the text input stream, one character at a time, and the dot is simply to grab the last character which stood before the negative lookahead term and was thus omitted ... If it is so, it's both brilliant and appears near-genius to my humble mind. ))) – a_hanif Jun 23 '13 at 23:21
  • Please correct errors in the above assumptions of mine if I've made any. Your contribution was high-calibre and comprehensive. Thank you very much for it. I won't forget to add votes to your reputation when I reach "the voting age" ))) – a_hanif Jun 23 '13 at 23:23
  • Yes, that's exactly how look arounds work, and it sounds like you have a solid understanding of how this expression works to meet your needs. :) hope to see you around. – Ro Yo Mi Jun 24 '13 at 01:11
1

This pattern works well:

SEARCH_TERM_#1(.*?)SEARCH-TERM_#2_OR_#3

The content you are interested by is in the first capture groups, see your language or software documentation to know how refer to the capture groups content.

If supported you can use lookarounds:

(?<=SEARCH_TERM_#1).*?(?=SEARCH-TERM_#2_OR_#3)

Then the result is the whole pattern.

Note that i use a lazy quantifier *? instead of a greedy quantifier *. More informations here.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Well, if you look at my question, you'll see that I tried a lookahead, an alternation inside an atomic group inside a lookahead and some weird stuff like an alternation of lookaheads inside an atomic group. ALL THESE variants capture the search term #2 AND stop before the #3, whereas I NEED the data to be collected stop BEFORE WHICH ever comes FIRST of #2 and #3. To be specific I need the collected data start after 'style="background-image: url(' and stop BEFORE either '&cfs=1' or '\);"' which ever comes first (the search terms without the quotes '...') – a_hanif Jun 22 '13 at 18:16
  • I say it again my problem is that regex machine captures '&cfs=1' and stops before '\);"' although '&cfs=1' comes before '\);"' . I'm using PowerGrep with the command 'collect data'. Thanks, all comments are most welcome. – a_hanif Jun 22 '13 at 18:17