0

I have a Perl script that executes a regex to find a markup tag (<tag>).

My regex is: <tag([^>]+)>

This works for most intances, however, I've found an instance where it's not working.

If <tag> has the following formation....

<tag 
attr="12345">

The regex works fine.

However, if <tag> looks like this.

<tag attr="12345"
>

No match. I've tested my regex is Notepad++ and it works fine for all instances. The problem is in my Perl script.

I've attempted several end-of-line anchors, but no luck thus far. Any help is much appreciated!

Edited Here is my line of code.

$line =~s/<tag([^>]+)>/<!--tag $1-->/g;
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Jeff
  • 877
  • 2
  • 11
  • 17
  • Not sure if the regex is the problem. I've tried: `$ perl -e 'if (<> . <> =~ /]+)>/) { print "yes\n" }' yes` and it works. Maybe you need to show more code (maybe a small repro program). – ctn May 20 '13 at 16:13
  • Works for me: `my $s = qq(); $s =~ /]+)>/ and print $1;` gives `attr="12345"`. – choroba May 20 '13 at 16:14
  • Added my line of code. – Jeff May 20 '13 at 16:19
  • 9
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester May 20 '13 at 16:21
  • Sorry, Lester, but I'm not parsing HTML. I'm working with a pipeline with very specific requirements. – Jeff May 20 '13 at 16:26
  • Try this: `my $line = q/ – DavidO May 20 '13 at 16:28
  • Can you show a short perl script that demonstrates the problem? I suspect you were running something slightly different than you think you were, because what you present should work fine. – ysth May 20 '13 at 16:29
  • @AndyLester he's clearly not trying to write a HTML parser, he's looking for specific tag to add a value to. Also, linking to a PHP related page when the question is tagged `perl` is pretty useless. – Glitch Desire May 20 '13 at 16:32
  • 4
    My mistake: Here's the Perl version: http://htmlparsing.com/perl – Andy Lester May 20 '13 at 16:34
  • `use Data::Dumper; print Dumper $line;` would tell if `$line` has complete or just partial tag. – mpapec May 20 '13 at 17:26
  • 1
    If it looks like XML or HTML, try using one of the established modules for that. There are HTML parsers that can be used to parse tags out of text even if they are not HTML tags but looks like them. – simbabque May 20 '13 at 18:29

1 Answers1

0

You call the string to manipulate $line. That's suspicious, because you must have concatenated multiple lines before, to check for multi-line tags. Please check (or post) you concatenating code too. I'm 90% sure the problem is there.

You have another problem too, which you're not necessarily aware of. If there are multiple tags on the same line, your regex will replace text between the first and last as well.

<tag foo="1">foo bar <tag bar="2">baz spam

Will become

<!--tag foo="1">foo bar <tag bar="2"-->baz spam

after your treatment, although you probably wanted

<!-- tag foo="1"-->foo bar <!--tag bar="2"-->baz spam

Please use the lazy version of the + quantifier: +?.

$line = ~s/<tag([^>]+?)>/<!--tag $1-->/g;
SzG
  • 12,333
  • 4
  • 28
  • 41