0

I am working on Perl programs to parse XML and do regex string replacements on the data. I read a couple of articles on string substitution using Perl.

While replacing the source value with the target string they are using some $ variables($1 ,$2, $3 and $4 etc.). How does the pattern store the values while doing the string comparison?

Please find some sample code which am looking for.

Sample XML file

  <Para>
    <Hyperlink Display="hide" Protocol="http" URN="https://www.basicurl.org/oid/10.1161/RIA.0000abc">
      AHA
    </Hyperlink>
    (Free)
  </Para>
  <Para>
    <Hyperlink Display="hide" Protocol="http" URN="https://www.abcd.com">
      Background: some text with multiple lines
    </Hyperlink>
    (i have three lines of code)
  </Para>
</Comment>

Target achievement

$Str =~ s|<Hyperlink\b[^\>]*?>([^\xFF]*?)([12][890][0-9]{2})([^\xFF]*?)</Hyperlink>|<Emph Emph.Type="Italic">$1</Emph>$2$3|g;

To my understanding, we are selecting hyperlink data and replacing the value in $str. The /g represents a global substitution. What are the values of $1, $2, and $3 from the above input file?

Community
  • 1
  • 1
Aswanikumar
  • 115
  • 1
  • 9
  • 1
    If you need to parse XML, use an XML parsing library like [XML::Parser](https://metacpan.org/pod/XML::Parser) or one of the many alternatives. [Classic answer about why you don't want to use res](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Shawn Sep 19 '18 at 20:45
  • The code is already implemented and i am trying to remove the legacy system and re implemented in the new technology . so i need some help on understanding this regular expression for the above input. – Aswanikumar Sep 19 '18 at 21:07
  • 1
    See [perldoc perlretut](https://perldoc.perl.org/perlretut.html#Grouping-things-and-hierarchical-matching) for more information about capture groups and the capture variables `$1`, `$2`, ... – Håkon Hægland Sep 19 '18 at 21:32

1 Answers1

2

When you use a regex pattern there is something called capturing groups, which are delimited by parentheses (...) in the pattern. They are numbered in the order that their opening parentheses appear in the pattern, and are used to save parts of a string to the built-in Perl variables $1, $2 etc.

For example /(Hello?)(goo?d)/ captures hello or hell in $1, and good or god in $2


About your example

$Str =~ s|<Hyperlink\b[^\>]*?>([^\xFF]*?)([12][890][0-9]{2})([^\xFF]*?)</Hyperlink>|<Emph Emph.Type="Italic">$1</Emph>$2$3|g;
  • ([^\xFF]*?) will capture any character not equal to FF (hexadecimal) from 0 to infinite amount of times. It can capture 0 characters or more not equal to "\xFF"

  • ([12][890][0-9]{2}) will capture a digit 1 or 2, followed by a digit 8 9 or 0, followed by two digits from 0 to 9.

  • ([^\xFF]*?) is the same as the first capturing group

Community
  • 1
  • 1
YOGO
  • 531
  • 4
  • 5