1

I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:

1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed. An example of my regex code is:

$file =~ s/<ul(.*)>//gi;

Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.

2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove

<script type="text/javascript"> 

function getCookies() { return ""; }

</script>

I have tried using various regex codes, but nothing seems to remove these lines. For instance:

$file =~ s/<script type(.*)<\/script>//gi;

Which removes the <script type...> and </script> tags respectively, but leaves the

function getCookies() { return ""; }

...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!

Edit: Sorry all, I'm using Perl! Also: I just tried using

$file =~ /<script type(.*)<\/script>/sgi

...as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the

function getCookies() { return ""; } 

...section stayed. Here is my entire code, including all regex:

use strict;
use warnings;

my $firstarg;
if ($ARGV[0]){
  $firstarg = $ARGV[0];
}

open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;

Ok, now that I deleted the <script> regex that was causing one problem, another has been created - using:

$file =~ s/<script type(.*)<\/script>//gi;

removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:

$file =~ s/<script type(.*)<\/script>//mgi;

results in the exact same thing. Using:

$file =~ s/<script type(.*)<\/script>//sgi;

results in the printing of several new line characters, but no other text, same for /msgi. Urgh, the problems never end... :(

NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
Sheldon
  • 323
  • 1
  • 4
  • 10
  • because of the `=~` I'm assuming you're writing Perl here? A little more info about the environment would be useful :) – Wolph Jan 30 '11 at 02:47
  • For #2, change /gi to /msgi to handle multi-line patterns. – Paul Beckingham Jan 30 '11 at 02:48
  • Please look at the solutions I posted in the link in my answer, especially the second one. This is why one uses parsers. – tchrist Jan 30 '11 at 04:14
  • 2
    **Please** use the `HTML::TreeBuilder` module. If you aren’t already sufficiently motivated to do so after reading [this](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326), I don’t know what will convince you. – tchrist Jan 30 '11 at 04:26
  • I'm not allowed to use modules on this assignment. Its supposed to be pure regex code, minus the input/search code. Believe me, I would love to use a module, it would make my life so much easier. :( – Sheldon Jan 30 '11 at 04:41

5 Answers5

1

I'm not sure what programming language you're using, but assuming that you're in perl, try putting the s modifier at the end of the regex:

$file =~ /<script type(.*)<\/script>/sgi

The /s modifier makes the . match any character, including newlines (normally it doesn't include newlines)


Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the s/ in front is for substitutions. In this case, your regex should be:

$file =~ s/<script type(.*)<\/script>/sgi

to remove everything, including the script tags. However, if you just want the content between the tags it is:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/$1$2/sgi;

Notice the $1$2 between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) like s/<ul(.*)>/ /gi.


Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:

$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/ /sgi;

and another generic regex for all the other tags:

$file =~ s/<\/?\s*[^>]+>//sgi

I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.

kelloti
  • 8,705
  • 5
  • 46
  • 82
  • Haha, I didn't notice the missing s either... Hmmm, that hasn't fixed it, though... I'm not sure what is going on... That bit of code SHOULD remove everything within those tags... Also, for the substitution in perl, there needs to be an extra / right before "sgi" to indicate what text you want to substitute in. I tried that as well, no luck. Sad panda. – Sheldon Jan 30 '11 at 03:15
  • Putting in the first regex gives me a syntax error, which I believed would be caused due to not having designated what should be substituted in places of – Sheldon Jan 30 '11 at 03:47
  • Sorry, can you paste some code here so I can see what you're running? I have a perl shell opened and I'm not getting syntax errors... – kelloti Jan 30 '11 at 03:54
1

To reply your last comment:

perl -e'$file="<script etc>\nfoo\n</script>bar"; $file =~ s/<script.*script>//gis; print $file'

this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.

....

Can you add this:

use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($file);

before the regexp and give us the result?

.....

Bingo:

line 5 and 6 of your $file =~ list already filter them out:

$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
## Here they come:
$file =~ s/<script(.*)>//gi;
$file =~ s/<\/script>//gi;
$file =~ s/<head>//gi;
Harmen
  • 669
  • 3
  • 8
  • Putting this into my code causes my output to be... well nonexistant. I think its caused by making $file equal "//gis; doesn't remove the tag either. I don't understand at all what is wrong... :( – Sheldon Jan 30 '11 at 03:34
  • Added a question in my answer (makes a mess when I do that in a comment) – Harmen Jan 30 '11 at 03:38
  • When I put that code in, it prints the text file (minus the regex). – Sheldon Jan 30 '11 at 03:42
  • Ah yes. I added a line in the question($Data::Dumper::Useqq=1;). Sorry 'bout that. Could you try again? – Harmen Jan 30 '11 at 03:48
  • I tried this in 2 ways. First I let the regex run after this code executed, resulting in a wall of text that seemed to be composed of the original text file, minus many of the spaces, and then a regex version of the file (minus the regex that I'm trying to fix, of course lol). On the second trial, I commented out the regex, resulting in the same aforementioned wall, and then the text file printed out as it would minus any regex. – Sheldon Jan 30 '11 at 03:54
  • Ok, would it be possible to post the first wall of text? I suspect there is something funny in there somewhere which makes your regep do something different than you expect it to do. – Harmen Jan 30 '11 at 03:58
  • I just posted the code I'm using - maybe there is an error there. The wall of text is pretty large... I'm brand new to this site (and I must say I love it, everyone is so helpful :D :D :D), so I'm not sure if there is a character cap on the question section or not... – Sheldon Jan 30 '11 at 04:02
  • Oh. My. God. I feel so stupid, I didn't even see that in there (I thought I had commented out all of the script regex). Thanks for pointing that out! – Sheldon Jan 30 '11 at 04:14
  • no problem. Do look at my other answer too, and read a bit more about perl regexpes :) – Harmen Jan 30 '11 at 04:22
  • Guys, you just cannot attack HTML this way. Really. – tchrist Jan 30 '11 at 04:29
  • @tchrist: no of course not, but it _will_ teach Sheldon a lot about regexpes, their limits and how to debug code. And isn't that the goal of homework? – Harmen Jan 30 '11 at 04:34
1

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:

#!/usr/bin/perl -w
use strict;
use warnings;

$_ = do { local $/; <DATA> };

# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
s{
  <               # open tag
  (?:             # open group (A)
    (!--) |       #   comment (1) or
    (\?) |        #   another comment (2) or
    (?i:          #   open group (B) for /i
      (           #     one of start tags
        SCRIPT |  #     for which
        APPLET |  #     must be skipped
        OBJECT |  #     all content
        STYLE     #     to correspond
      )           #     end tag (3)
    ) |           #   close group (B), or
    ([!/A-Za-z])  #   one of these chars, remember in (4)
  )               # close group (A)
  (?(4)           # if previous case is (4)
    (?:           #   open group (C)
      (?!         #     and next is not : (D)
        [\s=]     #       \s or "="
        ["`']     #       with open quotes
      )           #     close (D)
      [^>] |      #     and not close tag or
      [\s=]       #     \s or "=" with
      `[^`]*` |   #     something in quotes ` or
      [\s=]       #     \s or "=" with
      '[^']*' |   #     something in quotes ' or
      [\s=]       #     \s or "=" with
      "[^"]*"     #     something in quotes "
    )*            #   repeat (C) 0 or more times
  |               # else (if previous case is not (4))
    .*?           #   minimum of any chars
  )               # end if previous char is (4)
  (?(1)           # if comment (1)
    (?<=--)       #   wait for "--"
  )               # end if comment (1)
  (?(2)           # if another comment (2)
    (?<=\?)       #   wait for "?"
  )               # end if another comment (2)
  (?(3)           # if one of tags-containers (3)
    </            #   wait for end
    (?i:\3)       #   of this tag
    (?:\s[^>]*)?  #   skip junk to ">"
  )               # end if (3)
  >               # tag closed
 }{}gsx;         # STRIP THIS TAG

print;

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

remove script, ul


1
2
paragraph

NOTE: This regex doesn't work for nested tag-containers e.g.:

<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested &lt;object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
  <object data="uri:another">uri:another fallback
  </object>!!!this text should be striped too!!!
</object>

Output

Nested &lt;object> example

!!!this text should be striped too!!!

Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:

#!/usr/bin/perl -w
use strict;
use warnings;

use HTML::Parser ();

HTML::Parser->new(
    ignore_elements => ["script"],
    ignore_tags => ["ul"],
    default_h => [ sub { print shift }, 'text'],
    )->parse_file(\*DATA) or die "error: $!\n";

__END__
<html><title>remove script, ul</title>
<script type="text/javascript"> 

function getCookies() { return ""; }

</script>
<body>
<ul><li>1
<li>2
<p>paragraph

Output

<html><title>remove script, ul</title>

<body>
<li>1
<li>2
<p>paragraph
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

You’re going to have to be a lot more careful than that. See both approaches in this answer.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • I've taken a look at the second solution, but unfortunately, I'm very new at working with perl, and am unsure as to how this code actually works. I'm assuming: s{ <! DOCTYPE .*? > }{}sx; s{ <! \[ CDATA \[ .*? \]\] > }{}gsx; s{ $style_tag_rx .*? < (?&WS) / (?&WS) style (?&WS) > }{}gsix; s{ $script_tag_rx .*? < (?&WS) / (?&WS) script (?&WS) > }{}gsix; s{ }{}gsx; is what does the actual replacement, but I honestly have no idea how this works... – Sheldon Jan 30 '11 at 04:35
0

This:

$file =~ s/<div(.*)>//gi;

won't do what you expect. The '*' operator is greedy. If you have a line like:

hello<div id="foo"><b>bar!</b>baz

it'll substitute as much as it can, leaving only:

hellobaz

You want:

$file =~ s/<div[^>]*>//gi;

or

$file =~ s/<div.*?>//gi;
Harmen
  • 669
  • 3
  • 8
  • I'll post the assignment as best as I can in the question! – Sheldon Jan 30 '11 at 04:21
  • `.` is not an operator in a regex, nor is it greedy. You mean the postfix quantifiers, which are operators of a sort, and will by default match maximally. – tchrist Jan 30 '11 at 04:22
  • @Sheldon: good luck with your homework. If it doesn't work as expected start with just 1 regexp and add the rest one by one, watching how the result changes. You'll get the hang of it one day ;) – Harmen Jan 30 '11 at 04:30