33

If I have a lot of matches, for example in multi line mode, and I want to replace them with part of the match as well as a counter number that increments.

I was wondering if any regex flavor has such a variable. I couldn't find one, but I seem to remember something like that exists...

I'm not talking about scripting languages in which you can use callbacks for replacement. It's about being able to do this in tools like RegexBuddy, sublime text, gskinner.com/RegExr, ... much in the same way you can refer to captured substrings with \1 or $1.

  • 8
    There are languages that allow to call a specified function, for example JavaScript: `var i=0; "foobar".replace(/o/g, function(match) { return match+"("+(i++)+")";})`. – Gumbo Nov 18 '10 at 10:36
  • So what language are you using? – Gumbo Nov 18 '10 at 10:37
  • Im using tools like http://www.gskinner.com/RegExr/ or regexbuddy to ease manual editing of blocks of code, so something that works in those kind of tools would be best –  Nov 18 '10 at 10:49
  • 2
    It's language-agnostic if the OP assumes all regex flavors are the same, or that a counter is a common feature. It isn't. Also, a callback isn't really part of the flavor, it's just a fancy iterator. Either way, you should post the language you're using, maybe there's a clever 2-step solution. – Kobi Nov 18 '10 at 10:50
  • You’d probably get more solutions if you asked for a few possible target languages. On the other hand, you might also miss some interesting solutions that way, too, since you’d likely get stuck with a very small Greatest Common Factor. – tchrist Nov 18 '10 at 13:23

2 Answers2

66

FMTEYEWTK about Fancy Regexes

Ok, I’m going to go from the simple to the sublime. Enjoy!

Simple s///e Solution

Given this:

#!/usr/bin/perl

$_ = <<"End_of_G&S";
    This particularly rapid,
        unintelligible patter
    isn't generally heard,
        and if it is it doesn't matter!
End_of_G&S

my $count = 0;

Then this:

s{
    \b ( [\w']+ ) \b
}{
    sprintf "(%s)[%d]", $1, ++$count;
}gsex;

produces this

(This)[1] (particularly)[2] (rapid)[3],
    (unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8], 
    (and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!

Interpolated Code in Anon Array Solution

Whereas this:

s/\b([\w']+)\b/#@{[++$count]}=$1/g;

produces this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

Solution with code in LHS instead of RHS

This puts the incrementation within the match itself:

s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;

yields this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

A Stuttering Stuttering Solution Solution Solution

This

s{ \b ( [\w'] + ) \b             }
 { join " " => ($1) x ++$count   }gsex;

generates this delightful answer:

This particularly particularly rapid rapid rapid,
    unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard, 
    and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!

Exploring Boundaries

There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.

I really wish people understood that \b isn’t what they think it is. They always think it means there's white space or the edge of the string there. They never think of it as \w\W or \W\w transitions.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

As you see, it's conditional depending on what it's touching. That’s what the (?(COND)THEN|ELSE) clause is for.

This becomes an issue with things like:

$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;

s{
    (?(?=[\-\w']) (?<![\-\w'])  | (?<![^\-\w']) )
    ( [\-\w'] + )
    (?(?<=[\-\w']) (?![\-\w'])  | (?![^\-\w'])  )
}{
    sprintf "(%s)[%d]", $1, ++$count
}gsex;

print;

which correctly prints

('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?

Worrying about Unicode

1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.

Imagine this string:

$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);

which you could have as a literal with use utf8:

use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);

This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:

#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;

$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);

my $count = 0;

s{ (?<WORD> (?&full_word)  )

   # the rest is just definition
   (?(DEFINE)

     (?<word_char>   [\p{Alphabetic}\p{Quotation_Mark}] )

     (?<full_word>

             # next line won't compile cause
             # fears variable-width lookbehind
             ####  (?<! (?&word_char) )   )
             # so must inline it

         (?<! [\p{Alphabetic}\p{Quotation_Mark}] )

         (?&word_char)
         (?:
             \p{Dash}
           | (?&word_char)
         ) *

         (?!  (?&word_char) )
     )

   )   # end DEFINE declaration block

}{
    sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;

print;

That code when run produces this:

(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?

Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Thanks for the extensive study. If I ever intend to use perl, I'll sure have a thorough look a this. Im quite confident that any scripting language can do this though, so if I end up writing a script for this I'll probably stick to python since I know python already. –  Nov 18 '10 at 13:35
  • 4
    @ufotds: you can’t do some of those things in Python, because Python has no support for Unicode properties nor for definition blocks and using named buffers as subroutines. If you are working with Unicode text and regexes, you really have to make severe compromises if you aren’t using either Perl or **real** PCRE. See [uniprops](http://training.perl.com/scripts/uniprops) and [unichars](http://training.perl.com/scripts/unichars) for what you’re missing out on in terms of Unicode property support. – tchrist Nov 18 '10 at 14:18
1

In plain regular expressions there isn't as far as I know.

On the other hand, there are several tools which offer it as an extension, for example grepWin. In the tool's help (press F1):

grepWin help regarding replacement placeholders

Internally it uses Boost's Perl Regular Expression engine but the ${count} is implemented within (as with other extensions).