Returning line numbers of a regex match across multiple lines

Question

I'm trying to write a tool that will find empty XML tags which are spanned across multiple lines in a large text file. E.g. don't match:

<tag>
ABC
</tag>

And match:

<tag>
</tag>

I have no problem in writing the regex to match whitespace across multiple lines, but I need to find the line numbers where these matches occur (approximately at least).

I would split my text file into an array, but then it'll be pretty tricky to match across multiple array elements as there may be > 2 lines of tags/whitespace.

Any ideas? My implementation needs to be in Perl. Thanks!

See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Svante, Jan 27 '11 at 13:06
+1 a million times to the link Svante gave. Just in case you missed it, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Robert P, Jan 27 '11 at 16:30

score 4 · Answer 1 · answered Jan 27 '11 at 12:45

4

if ($string =~ $regex) {
    print "Match starting line number: ", 1 + substr($string,0,$-[0]) =~ y/\n//, "\n";
}

answered Jan 27 '11 at 12:45

ysth

96,171
6
121
214

score 3 · Answer 2 · answered Jan 27 '11 at 12:09

3

In this kind of work, I'd rather use an xml parser and output the line number of the closing empty tag than trying to do some cumbersome regex work.

answered Jan 27 '11 at 12:09

jibay

105
6

score 0 · Answer 3 · answered Jan 27 '11 at 13:08

If there is only one <tag> per line, you can use the the specail variable $. that contains the current line number.

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;

my ($begin, $tag) = (0, 0, '');
while (my $line = <DATA>) {
  chomp $line;
  if ($line =~ m#<(tag).*?>#) {
    $tag = $1;
    $begin = $.;
    next;
  }
  if ($line =~ m#</($tag).*?>#) {
    if ($. - $begin < 2) {
      say "Empty tag '$tag' on lines $begin - $.";
    }
    $begin = 0;
    $tag = '';
  }
}

__DATA__
<tag>
ABC
</tag>

<tag>
</tag>

output:

Empty tag 'tag' on lines 5 - 6

score 0 · Answer 4 · answered Jan 27 '11 at 15:21

If you need a robust solution, use a real XML parser rather than naive pattern matching.

If you are prepared to use a fragile approach that may not always give the right answers, then see below :-)

#!/usr/bin/perl
use warnings;
use strict;

my $xml =<<ENDXML;
<tag>
stuff
</tag>
<tag>


</tag>
<p>
paragraph
</p>
<tag> </tag>
<tag>
morestuff
</tag>
ENDXML

while ($xml =~ m#(<tag>\s*</tag>)#g) {
    my $tag = $1;

    # use substr() as an "lvalue" to find number of lines before </tag>
    my $prev_lines = substr($xml, 0, pos($xml)) =~ tr/\n// + 1;

    # adjust for newlines contained in the matched element itself
    my $tag_lines = $tag =~ tr/\n//;

    my $line = $prev_lines - $tag_lines;
    print "lines $line-$prev_lines\n$tag\n";
}

Returning line numbers of a regex match across multiple lines

4 Answers4