1

I'd like to remove all attributes of <p> in an HTML file by using this simple Perl command line:

$ perl -pe 's/<p[^>]*>/<p>/' input.html

However, it won't substitute e.g. <p class="hello"> that spans multiple lines such as

<p 
class="hello">

Thus, I attempted to first remove the end of line by doing

# command-1
$ perl -pe 's/\n/ /' input.html > input-tmp.html
# command-2
$ perl -pe 's/<p[^>]*>/<p>/g' input-tmp.html > input-final.html

Questions:

  1. Is there an option in (Perl) regex to try the match across multiple lines?
  2. Can I combine the two commands above (command-1 and command-2) into one? Basically, the first command needs to complete execution before the second one starts.
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
moey
  • 10,587
  • 25
  • 68
  • 112
  • 7
    What about `
    `? Allow me to draw your attention to [this enlightning answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)...
    – johnsyweb Oct 25 '11 at 08:38
  • 1
    @ikegami: I was trying to remove all the attributes of `

    `, not the tag / element itself.

    – moey Oct 25 '11 at 10:12
  • @Johnsyweb: Thank you for pointing that post. I was using regex to clean up some HTML files, not to necessarily "parse" it -- I am glad! – moey Oct 25 '11 at 10:18
  • 1
    If you do not parse the HTML to clean it up, you might end up making a **clbuttic** mistake. – Sinan Ünür Oct 25 '11 at 13:28
  • @Siku-Siku.Com, ah yes. Comment retracted. – ikegami Oct 25 '11 at 19:10

4 Answers4

3

-p is short for

LINE: while (<>) {
   ...
} continue {
   print
      or die "-p destination: $!\n";
}

As you can see $_ only contains one line at a times, so the pattern can't possibly match something that spans more than one line. You can fool Perl into thinking the whole file is one line using -0777.

perl -0777 -pe's/<p[^>]*>/<p>/g' input.html

Command line options are documented in perlrun.

ikegami
  • 367,544
  • 15
  • 269
  • 518
1

If you write a short script, and put it in its own file, you can easily invoke it using a simple command line.

Improving the following script is left as an exercise:

#!/usr/bin/perl

use warnings; use strict;
use HTML::TokeParser::Simple;

run(\@ARGV);

sub run {
    my ($argv, $opt) = @_;

    my $el = shift @$argv;

    for my $src (@$argv) {
        clean_attribs($src, $el, $opt);
    }
}

sub clean_attribs {
    my ($src, $el, $opt) = @_;
    my $el_pat = qr/^$el\z/;

    my $parser = HTML::TokeParser::Simple->new($src, %$opt);

    while (my $token = $parser->get_token) {
        if ($token->is_start_tag($el_pat)) {
            my $tag = $token->get_tag;
            print "<$tag>";
        }
        else {
            print $token->as_is;
        }
    }
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
0

perl -pe 'undef $/; s/<p[^>]*>/<p>/g'

Max
  • 1
-3
$ perl -pe 's/\n/ /; s/<p[^>]*>/<p>/gs;' input.html > input-final.html
atlau
  • 881
  • 1
  • 7
  • 16
  • 1
    There's no `.` for `s` to affect. It doesn't help. Furthermore, joining the two commands in this fashion caused the commands to stop working. – ikegami Oct 25 '11 at 09:31
  • Unfortunately, when you combined those two commands as suggested, the `

    ` substitution doesn't work across multiple lines even with the 's' added. I think it's because the first command and the second are in the same loop. As mentioned in the question, I think the first command needs to finish execution before the second should start.

    – moey Oct 25 '11 at 10:11