Perl - split html code by "table" tag and its contents

Question

I'm trying to split a chunck of html code by the "table" tag and its contents.

So, I tried

my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my @values = split(/<table*.*\/table>/, $html);

After this, I want the @values array to look like this: array('aaa', 'bbb', 'ccc'). But it returns this array: array('aaa', 'ccc'). Can anyone tell me how I can specify to the split function that each table should be parsed separately?

Thank you!

When parsing HTML, use a HTML parser. Perl has a good one, IIRC. — You, Aug 02 '11 at 15:38
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Leonardo Herrera, Aug 02 '11 at 15:40

score 4 · Accepted Answer · answered Aug 02 '11 at 15:33

4

Your regex is greedy, change it to /<table.*?\/table>/ and it will do what you want. But you should really look into a proper HTML parser if you are going to be doing any serious work. A search of CPAN should find one that is suited to your needs.

answered Aug 02 '11 at 15:33

Eric Strom

39,821
2
80
152

1

And in case it hasn't been linked enough yet, [here's why](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Robert P Aug 02 '11 at 15:47

score 3 · Answer 2 · answered Aug 02 '11 at 15:31

3

Your regex .* is greedy, therefore chewing its way to the last part of the string. Change it to .*? and it should work better.

answered Aug 02 '11 at 15:31

TLP

66,756
10
92
149

score 2 · Answer 3 · answered Aug 02 '11 at 15:33

2

Use a ? to specify non-greedy wild-card char slurping, i.e.

my @values = split(/<table*.*?\/table>/, $html);

answered Aug 02 '11 at 15:33

ipd

5,674
3
34
49

score 2 · Answer 4 · answered Aug 02 '11 at 15:46

Maybe using HTML parser is a bit overkill for your example, but it will pay off later when your example grows. Solution using HTML::TreeBuilder:

use HTML::TreeBuilder;
use Data::Dump qw(dd);

my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my $tree = HTML::TreeBuilder->new_from_content($html);

# remove all <table>....</table>
$_->delete for $tree->find('table');

dd($tree->guts);        # ("aaa", "bbb", "ccc")

Perl - split html code by "table" tag and its contents

4 Answers4