I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:
1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed. An example of my regex code is:
$file =~ s/<ul(.*)>//gi;
Which removes all lines with the basic format <ul...>
, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.
2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove
<script type="text/javascript">
function getCookies() { return ""; }
</script>
I have tried using various regex codes, but nothing seems to remove these lines. For instance:
$file =~ s/<script type(.*)<\/script>//gi;
Which removes the <script type...>
and </script>
tags respectively, but leaves the
function getCookies() { return ""; }
...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!
Edit: Sorry all, I'm using Perl! Also: I just tried using
$file =~ /<script type(.*)<\/script>/sgi
...as well as /msgi
, but neither worked unfortunately. Both the <script type>
and </script>
tags were removed, but for some reason the
function getCookies() { return ""; }
...section stayed. Here is my entire code, including all regex:
use strict;
use warnings;
my $firstarg;
if ($ARGV[0]){
$firstarg = $ARGV[0];
}
open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;
Ok, now that I deleted the <script>
regex that was causing one problem, another has been created - using:
$file =~ s/<script type(.*)<\/script>//gi;
removes everything in between the first instance of <script ...>
, but not the tag itself, not the repetitions of the tag throughout. Using:
$file =~ s/<script type(.*)<\/script>//mgi;
results in the exact same thing. Using:
$file =~ s/<script type(.*)<\/script>//sgi;
results in the printing of several new line characters, but no other text, same for /msgi
.
Urgh, the problems never end... :(
NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.