2

I've been googling & trying to get this myself but can't quite get it...

QUESTION: What regular expression could be used to select text BETWEEN (but not including) the delimiter text. So as an example:

Start Marker=ABC
Stop Marker=XYZ

---input---
This is the first line
And ABCfirst matched hereXYZ
and then
again ABCsecond matchXYZ
asdf
------------

---expected matches-----
[1] first matched here
[2] second match
------------------------

Thanks

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Greg
  • 34,042
  • 79
  • 253
  • 454
  • 2
    One quick comment for anyone reading this -- if you're looking at this question because you want to use regular expressions for XML parsing, don't. It's something I see folks trying to do in #bash frequently, and it's a Very Bad Idea -- XML parsing is surprisingly difficult to get right, and any attempt to capture the intricacies of the syntax in a regular expression is bound to fail. Use a library or tool built for the purpose -- if, like the folks asking in #bash, you want something you can use from a shell script, see XMLStarlet. – Charles Duffy Sep 28 '09 at 04:33

3 Answers3

10

Standard or extended regex syntax can't do that, but what it can do is create match groups which you can then select. For instance:

ABC(.*)XYZ

will store anything between ABC and XYZ as \1 (otherwise known as group 1).

If you're using PCREs (Perl-Compatible Regular Expressions), lookahead and lookbehind assertions are also available -- but groups are the more portable and better-performing solution. Also, if you're using PCREs, you should use *? to ensure that the match is non-greedy and will terminate at the first opportunity.

You can test this yourself in a Python interpreter (the Python regex syntax is PCRE-derived):

>>> import re
>>> input_str = '''
... This is the first line
... And ABC first matched hereXYZ
... and then
... again ABCsecond matchXYZ
... asdf
... '''
>>> re.findall('ABC(.*?)XYZ', input_str)
[' first matched here', 'second match']
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • would the \1 group contain "first matched here" and "second match", or everything between first ABC till last XYZ? – kender Sep 28 '09 at 04:19
  • 1
    @kender - To have only one match, two things would need to be true: The multiline flag would need to be set, and the asterisk would need to be greedy. Otherwise, we have two separate matches, each of which has its own groups. – Charles Duffy Sep 28 '09 at 04:23
  • I'm actually using C#, so is the idea I might be able to get at the groups (e.g. \1 group) in C#? – Greg Sep 28 '09 at 04:27
  • @Greg - Absolutely; if you have a Match m, see m.Groups. – Charles Duffy Sep 28 '09 at 04:29
  • got it thanks: foreach (Match match in matches) { GroupCollection groups = match.Groups; Console.Out.WriteLine(groups[1]); } – Greg Sep 28 '09 at 05:37
  • 1
    It isn't the Multiline flag that would need to be set, it's RegexOptions.Singleline (in Python it would be re.DOTALL or re.S). – Alan Moore Sep 28 '09 at 21:19
3

/ABC(.*?)XYZ/

By default, regular expression matches are greedy. The '?' after the . wildcard character, denotes a minimal match, so that the first match is this:

first matched here

...instead of this:

first matched hereXYZ
and then
again ABCsecond match 
0

You want the non-greedy match, .*?

while( $string =~ /ABC(.*?)XYZ/gm ) {
  $match = $1;
}
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Devin Ceartas
  • 4,743
  • 1
  • 20
  • 33