Variable interpolation in regular expressions substitutions

Question

I am working on Perl programs to parse XML and do regex string replacements on the data. I read a couple of articles on string substitution using Perl.

While replacing the source value with the target string they are using some $ variables($1 ,$2, $3 and $4 etc.). How does the pattern store the values while doing the string comparison?

Please find some sample code which am looking for.

Sample XML file

  <Para>
    <Hyperlink Display="hide" Protocol="http" URN="https://www.basicurl.org/oid/10.1161/RIA.0000abc">
      AHA
    </Hyperlink>
    (Free)
  </Para>
  <Para>
    <Hyperlink Display="hide" Protocol="http" URN="https://www.abcd.com">
      Background: some text with multiple lines
    </Hyperlink>
    (i have three lines of code)
  </Para>
</Comment>

Target achievement

$Str =~ s|<Hyperlink\b[^\>]*?>([^\xFF]*?)([12][890][0-9]{2})([^\xFF]*?)</Hyperlink>|<Emph Emph.Type="Italic">$1</Emph>$2$3|g;

To my understanding, we are selecting hyperlink data and replacing the value in $str. The /g represents a global substitution. What are the values of $1, $2, and $3 from the above input file?

If you need to parse XML, use an XML parsing library like [XML::Parser](https://metacpan.org/pod/XML::Parser) or one of the many alternatives. [Classic answer about why you don't want to use res](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Shawn, Sep 19 '18 at 20:45
The code is already implemented and i am trying to remove the legacy system and re implemented in the new technology . so i need some help on understanding this regular expression for the above input. — Aswanikumar, Sep 19 '18 at 21:07
See [perldoc perlretut](https://perldoc.perl.org/perlretut.html#Grouping-things-and-hierarchical-matching) for more information about capture groups and the capture variables `$1`, `$2`, ... — Håkon Hægland, Sep 19 '18 at 21:32

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

When you use a regex pattern there is something called capturing groups, which are delimited by parentheses (...) in the pattern. They are numbered in the order that their opening parentheses appear in the pattern, and are used to save parts of a string to the built-in Perl variables $1, $2 etc.

For example /(Hello?)(goo?d)/ captures hello or hell in $1, and good or god in $2

About your example

$Str =~ s|<Hyperlink\b[^\>]*?>([^\xFF]*?)([12][890][0-9]{2})([^\xFF]*?)</Hyperlink>|<Emph Emph.Type="Italic">$1</Emph>$2$3|g;

([^\xFF]*?) will capture any character not equal to FF (hexadecimal) from 0 to infinite amount of times. It can capture 0 characters or more not equal to "\xFF"
([12][890][0-9]{2}) will capture a digit 1 or 2, followed by a digit 8 9 or 0, followed by two digits from 0 to 9.
([^\xFF]*?) is the same as the first capturing group

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 19 '18 at 21:10

YOGO

531
4
5

Just a suggestion: Maybe also mention that a capture group is created by enclosing a regex or part of regex in parenthesis? – Håkon Hægland Sep 19 '18 at 21:29
Yes , you are right . I just explained it with this example – YOGO Sep 19 '18 at 21:35
Yes that was a good example. My only suggestion is to be even more explicit about it.. A newcomer might not understand everything as easily as we. – Håkon Hægland Sep 19 '18 at 21:38
Thanks, but you deleted the explanation about the lazy quantifier. There is a different result if you use *? Or just * – YOGO Sep 20 '18 at 05:25
1

@Borodin It looks great! – Håkon Hægland Sep 20 '18 at 07:59

Variable interpolation in regular expressions substitutions

Sample XML file

Target achievement

1 Answers1

About your example