Deleting one HTML table but not another using PERL regex substitution

Question

I'm still learning PERL so any help you could offer would be greatly appreciated. I'm sure there is a simple answer to the problem I am looking at, but I'm not sure I can figure it out. Thanks in advance for your help!

I have txt file with a bunch of HTML code in it. There are a number of HTML tables that I want to remove. However, there are a couple that I want to keep. These tables, the keepers, have specific words in them.

Lets say $txt represents the text document

$txt = "<TABLE> The brown dog runs </TABLE> 
        Here is another animal 
        <TABLE> The black cat walks </TABLE> 
        Here is another animal
        <TABLE> The Orange snake slithers </TABLE> 
        Here is another animal   
        <TABLE> Green lizard crawls </TABLE> 
        Here is another animal 
        <TABLE> The brown bird flys </TABLE> 
        Here is another animal          
        <TABLE> The green duck flys </TABLE> 
        Here is another animal";

I want to keep any table that has a brown animal flying animal. I don't want to keep any of the other tables. (I want to keep the 1st, 5th and 6th tables and get rid of the rest). So keep the table if it has the word brown or if it has the word flys, delete the table if it doesn't.

I have used the following regular expression to cut out tables in other cases, but this will remove all tables.

$txt =~ s{(<Table>.*?)(</Table>)}{table_was_here}ismog;

How could i modify this code to keep tables that contain certain text strings?

Thanks again!

Why not parse HTML with a HTML parser? Take a look at [Mojo::DOM](https://metacpan.org/module/Mojo::DOM), [HTML::Parser](https://metacpan.org/module/HTML::Parser) and [HTML::TreeBuilder](https://metacpan.org/module/HTML::TreeBuilder) — matthias krull, Jul 20 '12 at 07:03
Thanks for your suggestion. I do use an HTML parser, but I need to keep the contents of some of the tables and not the contents of other tables. I will perform the Table deletion first and then parse the HTML. I am using the HTML::TreeBuilder. — user1500158, Jul 20 '12 at 07:09
see: http://stackoverflow.com/a/1732454/632407 and http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — clt60, Jul 20 '12 at 07:11
Cute. So, just replace and
with TABLE and /TABLE. Pretend it isn't HTML. How would you answer my question? I'm fairly confident that the HTML code I'm working on isn't so complex that I'll end up with HTML tags spewing from my eyes while the Grim-regex-reaper creeps into my room and chops my fingers off to keep me from any more HTML parsing via regex. — user1500158, Jul 20 '12 at 07:21

score 0 · Answer 1 · answered Jul 20 '12 at 06:18

0

Change it to:

$txt =~ s{(<Table>.*?(brown|flys).*?(</Table>)}{table_was_here}ismog;

(small note, the correct spelling is "flies", not "flys")

answered Jul 20 '12 at 06:18

Bart Friederichs

33,050
15
95
195

This doesn't do it. You are missing a ). – user1500158 Jul 20 '12 at 06:37
Ye, I just found out you have to negate the (brown|flys) part. Cannot find how to do that now though. – Bart Friederichs Jul 20 '12 at 06:38
I'm trying to put a negative look ahead into it: $txt =~ s{(.*?(?!brown|flys)).*?(
)}{table_was_here}ismog; But this isn't working – user1500158 Jul 20 '12 at 06:39

score 0 · Accepted Answer · answered Jul 20 '12 at 17:38

Both of the following will work:

$txt =~ s{<TABLE>.*?</TABLE>}{$_ = $&; /brown|flys/ ? $_ : ''}isge;

for ( $txt =~ m{<TABLE>.*?</TABLE>}isg ) {
    $txt =~ s/$_// if !/brown|flys/;
}

Output of both:

<TABLE> The brown dog runs </TABLE> 
Here is another animal 

Here is another animal

Here is another animal   

Here is another animal 
<TABLE> The brown bird flys </TABLE> 
Here is another animal          
<TABLE> The green duck flys </TABLE> 
Here is another animal

Hope this helps!

These are great! Thanks so much! – user1500158 Jul 21 '12 at 08:46 — user1500158, Jul 21 '12 at 08:46

Deleting one HTML table but not another using PERL regex substitution

2 Answers2