9

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,

$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e

But when I moved the pattern and replacement to a variable, $1 was not evaluated.

$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e

When I use "ee" modifier, it gives errors.

$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
    (Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
    (Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
    (Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
    (Missing operator before _?)
aeae

What do I miss here?


Edit

Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later. Here are some examples of $p and $r.

^(.*\D)?((19|18|20)\d\d)年   $1$2<digits>年
^(.*\D)?(0\d)年  $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/])    $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d])    $1$3分之$2$4
kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • Note there could be security issues with using the `ee` modifier. See for example: [Using the ee modifier safely with the s/// operator when the right side is input from user](http://stackoverflow.com/q/29107353/2173773) – Håkon Hægland Dec 22 '16 at 13:18
  • @HåkonHægland Thanks, though safety issue is not a main concern here. I have full control of the list. – kangshiyin Dec 22 '16 at 13:35
  • 2
    Yes, you think you have control. Suddenly one day a typo sneaks in, and something bad happens. – Håkon Hægland Dec 22 '16 at 13:40

1 Answers1

9

With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.

This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.

As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).

The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need

'$1_$2$3other'  -->  $1 . '_' . $2 . $3 . 'other'

which is valid Perl code that can be evaluated.

The part of breaking this up is helped by split's capturing in the separator pattern.

sub repl {
    my ($r) = @_;

    my @terms = grep { $_ } split /(\$\d)/, $r;

    return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } @terms;
}
    
$var =~ s/$p/repl($r)/gee;

With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.

Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.

Then /ee will have this function return the string (such as above), and evaluate it as valid code.

We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge

For more discussion of /ee see this post, with several useful answers.


Note on using "b(.)d" for a regex pattern

In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • It seems `qr` is not required as my `$p` can be matched in the second case. My problem is that both pattern and replacement string are read from file, and the number of `$N` is unknown, and depends on the number of parenthesis in the pattern string... – kangshiyin Dec 22 '16 at 10:24
  • @kangshiyin Ugh ... I thought there may be more to it. But then, things do/may depend on details of the replacement. What do you know about what `$r` may be? – zdim Dec 22 '16 at 10:58
  • @kangshiyin Can you decide how `$r` is composed in that file? The problem is that `$r = '$1'` can be double-evaluated -- first `$r` becomes literal `$1`, then `$1` is evaluated. (This is `/ee`) However, `$r = '_$1'` cannot work that way -- on the second evaluation it will choke on the bareword `_`. Well, then we can set `$r = '$1'` and do `s/$p/q(_) . eval $r/ge` ... but this requires access to `$r`, or at least knowledge of what it looks like. – zdim Dec 22 '16 at 11:10
  • Yes, both `$p` and `$r` are written by myself. What I need is to do multiple arbitrary regex replacing without touching the perl code, so `$p` and `$r` are in a separate data file. I hope this file can be used with C++/python code later. I've added some examples of `$p` and `$r` in the question. – kangshiyin Dec 22 '16 at 11:54
  • @kangshiyin Great, this helps, thank you. I have to go now but will add to the answer in a few hours. Can you please clarify -- is unicode indeed involved? Or do those characters you edited get translated? Can you also confirm, is `$r` _always_ of the form `$1$2`? – zdim Dec 22 '16 at 11:59
  • I will give more examples. I think it will be an open form. There are unicode involved - I use utf8 in all context. Thanks, please take your time. – kangshiyin Dec 22 '16 at 12:03
  • @kangshiyin Updated to your clarifications and edits, thank you. – zdim Dec 22 '16 at 21:16
  • *"When you do `$p="b(.)d"` ... the parens and dot won't have the expected meaning"* In fact they will. Try for example: `perl -E '$a="a.c"; $_="abcd";s/$a//;say'` – Håkon Hægland Dec 22 '16 at 22:47
  • @HåkonHægland Heh, surprise surprise. I wonder why parens and dot work, since in general regex patters don't. I'll update now, thank you very much! – zdim Dec 22 '16 at 23:04
  • @HåkonHægland Fixed, thank you again. This is a puzzle for me! – zdim Dec 22 '16 at 23:17
  • Yeah it's pretty confusing :) But I think [this](http://stackoverflow.com/a/392717/2173773) answer can give some more insight. Also see the documentation for [quotemeta](http://perldoc.perl.org/functions/quotemeta.html) for what is interpolated. – Håkon Hægland Dec 22 '16 at 23:38
  • 1
    @HåkonHægland Yeah -- they helped by reminding me of the whole thing. I myself was for years intent on using single quotes over `qr`. Funny how one (I) can forget. I didn't realize that what ticked me off were _double_ quotes. Those I use only when I specifically _intend to_ interpolate, so it pushed me to go overboard. Thank you! Also, it's a good and informative link, I'll find a way to stick it in the post. I adjusted the explanation a bit further. – zdim Dec 23 '16 at 07:19