Perl regex on HTML markup

Question

I just want to delete the block between

<!DOCTYPE html>

and

 <body>

including those ends, using a perl regex.

Example text:

<!DOCTYPE html>


<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style>code{white-space: pre;}</style>



<![endif]-->;

<body>
.
.
.
anything here

This is only a sample, my real file contains an embedded long javascript

I usually test my regex @ regex101 website and I made this one

<\!DOCTYPE html>(\n.*)*<body>

and this one that consider any space in the ends.

s/<\!DOCTYPE html>(\n.*)*<[ \t]*body[ \t]*>//gi;

It seems to work good on that website but it doesn't work when I run inside a perl script.

PERL SCRIPT (with @Jan answer):

#!/usr/bin/perl
use strict;
use warnings;

my $dirtfile = $ARGV[0];
my $cleanfile = "clean.html";

open(IN, "<", $dirtfile) or die "Can't open $dirtfile: $!";
open(OUT, ">", $cleanfile) or die "Can't open $cleanfile: $!";

while (<IN>) {
  s/(?s)<!DOCTYPE html>.+?<body>(?-s)//gi;
  print(OUT);
}

OUTPUT:

the same as input

*but it doesn't work* <= we'll need way more information about that — Thomas Ayoub, Mar 03 '16 at 12:58

mut3 · Answer 1 · 2019-03-20T19:02:33.350

2

I'm pretty sure you're reading the file line-by-line which should render your regex useless. I think you'll either need to read the entire file into a string and use regex that way, or edit your loop logic to remove everything before and after you see the tag.

In general, you should avoid working on HTML with regexes. Use a DOM extension instead.

edited Mar 20 '19 at 19:02

answered Mar 03 '16 at 13:29

mut3

54
4

score 1 · Answer 2 · answered Mar 03 '16 at 14:20

Since you are not really parsing HTML, but instead chopping a leading part of the file, you may get away with using regular expressions. This may get much more complicated if you have the target strings in any comments etc, but, if that is not the case, simply using the flip-flop operator .. should do it:

$ perl -ne 'print unless /<!DOCTYPE html>/i .. /<body>/i' file.html</pre>

score 0 · Answer 3 · edited May 23 '17 at 12:23

0

It is usually considered bad practice to work with regular expressions on HTML, however you could nevertheless come up with:

(?s)<!DOCTYPE html>.+?<body>(?-s)
# switches on single line mode (aka dot matches all)
# takes <!DOCTYPE>
# everything afterwards lazily (.+?)
# including the body tag
# switch off single line mode off again

See a demo on regex101.com. It won't work as expected when there's a body tag somewhere in between (including comments, that is).

edited May 23 '17 at 12:23

Community

1
1

answered Mar 03 '16 at 12:59

Jan

42,290
8
54
79

It seems it's not working... I updated my question with per script and output file – LaboDJ Mar 03 '16 at 13:17

Perl regex on HTML markup

3 Answers3