2

I was experimenting with regex and got stuck with the following problem.

Say, I have lines which start and end with batman and some arbitary number in between and i want the numbers in capture group as well as the words batman.

batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman

This is easy to achieve (simple one => (\s*batman (\d+) batman\s*) DEMO).

Now I tried a little bit more.. putting the same data in a capture tag (#capture)

#capture
batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman
#capture

#others
batman 12345 batman
batman 234 batman
batman 35655 batman
batman 1311 batman
#others

I am trying to capture lines only between #capture and i tried

(?:#capture)(\s*batman (\d+) batman\s*)*(?:#capture)

which matches the pattern but includes only last iteration in capture group i.e $1=>batman $2=>1311 $1=>batman DEMO

I also tried to capture the repeating group using

(?:#capture)((\s*batman (\d+) batman\s*)*)(?:#capture)

This one captures everything.. but in different groups.. DEMO

Can someone help me understanding and solving this problem?

Expected results: capture only the group in #capture and all numbers in a group so that replacement could be easy.

Thanks.

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
  • You say you want any language, but you have used something that nothing but C♯ alone supports. That is not the standard way to name/do captures. – tchrist Apr 13 '15 at 13:10
  • Sorry.. i was more interested in logic btw... I updated the question to language specific. – karthik manchala Apr 13 '15 at 13:14
  • 1
    Oh I see. Please supply both the original input string and the desired output results separately, because it is still not clear what you want. You aren’t going to be able to capture just all numbers alone in a single match because they are discontiguous. You will need more program logic around this. You will also have to use `(?s)` mode if you have newlines involved. You could do do passes, one to get `/#capture((?s:(?!#capture).)*)#capture/` and then another to put out all the `/\b(\d+)\b/` matches from what the first one got. If you need more `batman` constraints, then you could add those. – tchrist Apr 13 '15 at 13:21
  • @tchrist thats a nice workaround.. i will surely keep it in mind while using in applications. – karthik manchala Apr 13 '15 at 13:28

2 Answers2

1

You can leverage non-fixed width look-behind in .NET regex flavor, and use this regex:

(?s)(?<=#capture.*?)(?:batman (\d+) batman)(?=.*?#capture)

enter image description here

However, this example works for the case you provided (e.g. it won't work if there are more #capture...#capture blocks further in the text), and you will just have to add more restrictions based on the tag context.

In PCRE/Perl, you can achieve a similar result with declaring what we want to skip:

(?(DEFINE)                          # Definitions
    (?<skip>\#others.*?\#others)    # What we should skip
)
(?&skip)(*SKIP)(*FAIL)              # Skip it
|
(?<needle>batman\s+(\d+)\s+batman)  # Match it

And, say, replace with batman new-$3 batman.

See this demo on regex101.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • can we achieve the same with simple regex pattern..? – karthik manchala Apr 13 '15 at 13:22
  • @karthikmanchala Why do you want the numbers? Do you want to change then in the data with a substitution/alteration, or do just want to pull them all out and do something else with them that does not affect the original data? Right now I think your problem is linewise processing. In Perl using `while () { ... }` style line-by-line processing, it is just `if (/#capture/ ... /#capture/) { push @numbers, /^batman\s+(\d+)\s+batman$/; }`. – tchrist Apr 13 '15 at 13:24
  • @tchrist once i capture numbers in a different group.. i should be able to do both.. no? i am interested in both if so.. – karthik manchala Apr 13 '15 at 13:27
  • 1
    @karthikmanchala No, this cannot be done in a single pattern because PCRE does not support discontinguous matches in a single pattern. You need auxiliary logic for storing the discontiguous matches and the linewise processing, not to mention the fencepost matches. Even if this were possible with one pattern, you almost certainly should not do so as it would be a nightmare to maintain. Use simple patterns, more than one, and you will be much happier. – tchrist Apr 13 '15 at 13:29
1

Since PCRE is unable to store repeated captures as with .net framework or the new regex module of Python, a possibility is to use the \G feature and a check after to be sure that the end of the block is reached.

The \G anchor marks the position at the end of the previous match and is used in a global research context (with preg_match_all or preg_replace*). It is useful to find contiguous results. Note that until the first match \G marks by default the start of the string. So to prevent \G to succeed at the start of the string, you need to add the negative lookahead (?!\A).

$pattern = '~
(?:        # two possible branches
    \G(?!\A)       # the contiguous branch
  |
    [#]capture \R  # the start branch: only used for the first match
)
(batman \h+ ([0-9]+) \h+ batman)
\R    # alias for any kind of newlines 
(?: ([#]) (?=capture) )?  # the capture group 3 is used as a flag
                          # to know if the end has been reached.
                          # Note that # is not in the lookahead to
                          # avoid the start branch to succeed
~x';

if (preg_match_all($pattern, $text, $matches) && array_pop($matches[3])) {
    print_r($matches[1]);
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125