2

I'm trying to write a tool that will find empty XML tags which are spanned across multiple lines in a large text file. E.g. don't match:

<tag>
ABC
</tag>

And match:

<tag>
</tag>

I have no problem in writing the regex to match whitespace across multiple lines, but I need to find the line numbers where these matches occur (approximately at least).

I would split my text file into an array, but then it'll be pretty tricky to match across multiple array elements as there may be > 2 lines of tags/whitespace.

Any ideas? My implementation needs to be in Perl. Thanks!

moigno
  • 75
  • 1
  • 7
  • 1
    See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Svante Jan 27 '11 at 13:06
  • +1 a million times to the link Svante gave. Just in case you missed it, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Robert P Jan 27 '11 at 16:30

4 Answers4

4
if ($string =~ $regex) {
    print "Match starting line number: ", 1 + substr($string,0,$-[0]) =~ y/\n//, "\n";
}
ysth
  • 96,171
  • 6
  • 121
  • 214
3

In this kind of work, I'd rather use an xml parser and output the line number of the closing empty tag than trying to do some cumbersome regex work.

jibay
  • 105
  • 6
0

If there is only one <tag> per line, you can use the the specail variable $. that contains the current line number.

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;

my ($begin, $tag) = (0, 0, '');
while (my $line = <DATA>) {
  chomp $line;
  if ($line =~ m#<(tag).*?>#) {
    $tag = $1;
    $begin = $.;
    next;
  }
  if ($line =~ m#</($tag).*?>#) {
    if ($. - $begin < 2) {
      say "Empty tag '$tag' on lines $begin - $.";
    }
    $begin = 0;
    $tag = '';
  }
}

__DATA__
<tag>
ABC
</tag>

<tag>
</tag>

output:

Empty tag 'tag' on lines 5 - 6
Toto
  • 89,455
  • 62
  • 89
  • 125
0

If you need a robust solution, use a real XML parser rather than naive pattern matching.

If you are prepared to use a fragile approach that may not always give the right answers, then see below :-)

#!/usr/bin/perl
use warnings;
use strict;

my $xml =<<ENDXML;
<tag>
stuff
</tag>
<tag>


</tag>
<p>
paragraph
</p>
<tag> </tag>
<tag>
morestuff
</tag>
ENDXML

while ($xml =~ m#(<tag>\s*</tag>)#g) {
    my $tag = $1;

    # use substr() as an "lvalue" to find number of lines before </tag>
    my $prev_lines = substr($xml, 0, pos($xml)) =~ tr/\n// + 1;

    # adjust for newlines contained in the matched element itself
    my $tag_lines = $tag =~ tr/\n//;

    my $line = $prev_lines - $tag_lines;
    print "lines $line-$prev_lines\n$tag\n";
}
tadmc
  • 3,714
  • 16
  • 14