I just want to delete the block between
<!DOCTYPE html>
and
<body>
including those ends, using a perl regex.
Example text:
<!DOCTYPE html>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style>code{white-space: pre;}</style>
<![endif]-->;
<body>
.
.
.
anything here
This is only a sample, my real file contains an embedded long javascript
I usually test my regex @ regex101 website and I made this one
<\!DOCTYPE html>(\n.*)*<body>
and this one that consider any space in the ends.
s/<\!DOCTYPE html>(\n.*)*<[ \t]*body[ \t]*>//gi;
It seems to work good on that website but it doesn't work when I run inside a perl script.
PERL SCRIPT (with @Jan answer):
#!/usr/bin/perl
use strict;
use warnings;
my $dirtfile = $ARGV[0];
my $cleanfile = "clean.html";
open(IN, "<", $dirtfile) or die "Can't open $dirtfile: $!";
open(OUT, ">", $cleanfile) or die "Can't open $cleanfile: $!";
while (<IN>) {
s/(?s)<!DOCTYPE html>.+?<body>(?-s)//gi;
print(OUT);
}
OUTPUT:
the same as input