What is the meaning of this line in perl?

Question

$line =~ s/^<(\w+)=\"(.*?)\">//;

It's a [regex substitution](https://perldoc.perl.org/functions/s.html). It removes text like `` at the start of a string. — LukStorms, Sep 13 '17 at 08:15
Possible duplicate of [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) — LukStorms, Sep 13 '17 at 08:19
@LukStorms, this is more for comment than for direct duplicate. — Drag and Drop, Sep 13 '17 at 09:00

Dave Cross · Accepted Answer · 2017-09-13T10:42:12.363

The s/.../.../ is the substitution operator. It matches its first operand, which is a regular expression and replaces it with its second operand.

By default, the substitution operator works on a string stored in $_. But your code uses the binding operator (=~) to make it work on $line instead.

The two operands to the substitution operator are the bits delimited by the / characters (there are more advanced versions of these delimiters, but we'll ignore them for now). So the first operand is ^<(\w+)=\"(.*?)\"> and the second operand is an empty string (because there is nothing between the second and third / characters).

So your code says:

Examine the variable $line
Look for a section of the string which matches ^<(\w+)=\"(.*?)\">
Replace that part of the string with an empty string

All that is left now is for us to untangle the regular expression and see what that matchs.

^ - matches the start of the string
< - matches a literal < character
(...) - means capture this bit of the match and store it in $1
\w+ - matches one or more "word characters" (where a word character is a letter, a digit or an underscore)
= - matches a literal = character
\" - matches a literal " character (the \ is unnecessary here)
(...) - means capture this bit of the match and store it in $2
.*? - matches zero or more instances of any character
\" - matches a literal " character (once again, the \ is unnecessary here)
> - matches a literal >

So, all in all, this looks like a slightly broken attempt to match XML or HTML. It matches tags of the form <foo="bar"> (which isn't valid XML or HTML) and replaces them with an empty string.

score 0 · Answer 2 · answered Sep 13 '17 at 08:52

0

It's searching for an XML tag at the start of a string, and substituting it with nothing (i.e. removing it).

For example, in the input:

<hello="world">example

The regex will match <hello="world">, and substitute it with nothing - so the final result is just:

example

In general, this is something that you shouldn't do with regex. There are a dozen different ways you could create false negatives here, that don't get stripped from the string.

But if this is a "quick and dirty" script, where you don't need to worry about all possible edge cases, then it may be OK to use.

answered Sep 13 '17 at 08:52

Tom Lord

27,404
4
50
77

That's not valid XML, so doubt many of the XML parsers would accept it. – Chris Turner Sep 13 '17 at 09:07
Agreed -- but I presume that this line of code is part of a larger "XML sanitisation" script. I bet there's a bunch more substitution commands, to strip "other formats" of tags. – Tom Lord Sep 13 '17 at 10:02

What is the meaning of this line in perl?

2 Answers2