0

I have more than 100 html files with the following structure.

<html>
<head>
<body>
    <TABLE>
      ...
    </TABLE>
    <TABLE>
        <TR>
            <td rowspan=2><img src="http://www.example.com" width=10></td>
            <TD width=609 valign=top>
                <!-- Content of file1 -->
                <p>abc</p>
                ...
                ...
                ...
                <p>xyz</p>
            </TD>
        </TR>
        <TR>
            <TD align="center" ...alt="top"></a></TD>
        </TR>
    </TABLE>        
</body>
</html>

and I´d like to merged in a single HTML the content inside the column #2 of 1rst row from 2nd table (TABLE[2]ROW[1]COLUMN[2]) of each file to get an output like this

<html>
<head>
<body>
    <!-- Content of file1 -->
    <p>abc</p>
    ...
    ...
    ...
    <p>xyz</p>

            <!-- Content of file2 -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>

    ..
    ..
    ..
            <!-- Content of fileN -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>
</body>
</html>

I´m new to perl, and I ask for some help in order to point me out in how to do it. Thanks in advance.

Below begginig a essay for file1, but I´m not sure if I go in correct way.

use HTML::TableExtract;

open (my $html,"<","file1.html");

my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 2, br_translate => 0 );
$table->parse($html);

foreach my $row ($table->rows) {
    print join("\t", @$row), "\n";
}
Ger Cas
  • 2,188
  • 2
  • 18
  • 45

2 Answers2

2

Documentation HTML::TableExtract states that depth, count, row, col starts from 0.

Following code is a skeleton of the code with an assumption that all html files will be stored in one directory.

With an assistance of glob we obtain names of html files.

Then we write a subroutine extract_table_cell which we pass parameters depth,count,row,col to extract data located at this position.

Now for each filename we call extract_table_cell subroutine and store return data in an array @data.

Also we write subroutine gen_html which take reference to @data array and returns html code representing these data.

At this point we call say with subroutine gen_html as an argument to output result.

NOTE: you will require to change subroutine extract_table_cell to achieve desired format of cell data

use strict;
use warnings;
use feature 'say';

use HTML::TableExtract;

my($depth,$table,$row,$col) = (0,1,0,1);
my @data;

for (glob("*.html")) {
    push @data, extract_table_cell($_,$depth,$table,$row,$col);
}

say gen_html(\@data);

sub gen_html {
    my $data = shift;

    my($html,$block);

    for ( @{$data} ) {
        $block .= "\t\t$_\n";
    }

    $html =
"
<html>
    <head>
    </head>
    <body>
    $block
    </body>
</html>
";

    return $html;
}

sub extract_table_cell {
    my($file,$depth,$count,$row,$col) = @_;

    my $te = HTML::TableExtract->new( depth => $depth, count => $count );

    $te->parse_file($file);

    my $table = $te->first_table_found;

    return ${ $table->{grid}[$row][$col] };
}

Output

<html>
    <head>
    </head>
    <body>
        B 1.2
        D 1.2

    </body>
</html>

Test data files:

table_1.html

<html>
    <head>
    </head>
    <body>
        <table>
            <tr><td>A 1.1</td><td>A 1.2</td><td>A 1.3</td></tr>
            <tr><td>A 2.1</td><td>A 2.2</td><td>A 2.3</td></tr>
            <tr><td>A 3.1</td><td>A 3.2</td><td>A 3.3</td></tr>
            <tr><td>A 4.1</td><td>A 4.2</td><td>A 4.3</td></tr>
        </table>

        <table>
            <tr><td>B 1.1</td><td>B 1.2</td><td>B 1.3</td></tr>
            <tr><td>B 2.1</td><td>B 2.2</td><td>B 2.3</td></tr>
            <tr><td>B 3.1</td><td>B 3.2</td><td>B 3.3</td></tr>
            <tr><td>B 4.1</td><td>B 4.2</td><td>B 4.3</td></tr>
        </table>
    </body>
</html>

table_2.html

<html>
    <head>
    </head>
    <body>
        <table>
            <tr><td>C 1.1</td><td>C 1.2</td><td>C 1.3</td></tr>
            <tr><td>C 2.1</td><td>C 2.2</td><td>C 2.3</td></tr>
            <tr><td>C 3.1</td><td>C 3.2</td><td>C 3.3</td></tr>
            <tr><td>C 4.1</td><td>C 4.2</td><td>C 4.3</td></tr>
        </table>

        <table>
            <tr><td>D 1.1</td><td>D 1.2</td><td>D 1.3</td></tr>
            <tr><td>D 2.1</td><td>D 2.2</td><td>D 2.3</td></tr>
            <tr><td>D 3.1</td><td>D 3.2</td><td>D 3.3</td></tr>
            <tr><td>D 4.1</td><td>D 4.2</td><td>D 4.3</td></tr>
        </table>
    </body>
</html>
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • Yes `$ which env /usr/bin/env` This time it works. I tried with 2 real files and I get this error `$ ./script.pl > out.html Wide character in say at ./script.pl line 16.` – Ger Cas Mar 25 '20 at 19:52
  • @GeCas -- this error related to completely different area UTF, I guess that you process German text which is outside of ASCII range. You need to indicate that your input and output is encoded probably with UTF-8 (may be UTF-16 very rare as UTF-8 used more often on the web cloud). Please see the following [documentation](https://perldoc.perl.org/open.html) for opening encoded files. Probably in your case it might be sufficient to use `locale` approach. – Polar Bear Mar 25 '20 at 19:57
  • @GerCas -- some information on [utf8](https://perldoc.perl.org/utf8.html) in perl code. [How can I output UTF-8 from Perl?](https://stackoverflow.com/questions/627661/how-can-i-output-utf-8-from-perl). – Polar Bear Mar 25 '20 at 20:01
  • @GerCas -- if you look into documentation of [HTML::ExtractTable](https://metacpan.org/pod/HTML::TableExtract) there is an option `keep_html` to preserve **html tags** in extracted data. You just need to activate this option in creating `$te` object. – Polar Bear Mar 25 '20 at 20:14
  • @GerCas -- try `keep_html => 1` instead. – Polar Bear Mar 25 '20 at 20:29
  • It works with `keep_html => 1`. Thanks so much for your help. It seems to work, the only issue that remains is mentioned in the links you shared `Embedded tables are not retained in the HTML extracted from a cell`. I tested and the element `
    some data
    ` in output looks empty like this `
    `. I think I should open a new question regarding this or it has an easy solution?
    – Ger Cas Mar 25 '20 at 20:38
  • @GerCas -- well, I also notice that tables is not retained in the cell. It is understandable as this module extracts tables and operates with _depth,count,row,col_ variables. It might be that you have to go into _depth_ to extract table stored in the cell. Although I see it as if you have not only table there but some data as well -- you probably will end up with the code you already have plus extra code to extract the table in the cell by going into _depth_. – Polar Bear Mar 25 '20 at 20:42
  • @GerCas -- if [HTML::ExtractTable](https://metacpan.org/pod/HTML::TableExtract) will not best fit for your _target_ then look at [HTML::DOM](https://metacpan.org/pod/HTML::DOM) module which will parse html document to _Document Object Model_ tree. Once you have DOM object you can navigate any branch into depth to reach desired element. But be warned that a lot of webpages are _broken_ (no closing tags in right place, or html do not comply with DOM model) and you have to find how go around problems with such cases. – Polar Bear Mar 25 '20 at 20:47
  • Excellent. Thanks so much one more time for the help. It seems to work only with that table issue, I think would be easier for me to go manually and insert that missing table blocks in output than trying to modify your very useful code to handle that part. Regards – Ger Cas Mar 25 '20 at 20:53
  • @GerCas -- this code can be easily _improved_ by passing _depth,table,row,col_ as an argument on command line. By doing so any particular cell data can be extracted from many html files. I am not sure about purpose of such code but life is full surprises -- sometimes something insignificant grows up into unexpected project and people find new use of it. – Polar Bear Mar 25 '20 at 21:03
  • @GerCas - A very good example is perl itself -- Larry Wall was working as system admin and he had to program in many tools and shells, in the end he created perl and made it publicly available, he started to receive requests to add some feature, then people started to submit their solution which turned in the end into perl module. Today CPAN is a huge library which makes many things possible. – Polar Bear Mar 25 '20 at 21:03
  • Thanks for your comments, I'll play around with what you said about changing epth,table,row,col. The purpose of the code is because I have several small html files structured in not a good/easy way and I want to have all of them in a more clear and easy structure without tables and in a single html. – Ger Cas Mar 25 '20 at 21:10
  • 1
    @GerCas -- if you have or will have significant amount of data then it might make sense to store data in database [SQLite](https://www.sqlite.org/index.html), [MariaDB](https://mariadb.org/), [MySQL](https://www.mysql.com/),[PostgreSQL](https://www.postgresql.org/). It is much faster than read from files and with [SQL](https://en.wikipedia.org/wiki/SQL) you can _slice and dice_ as _circus juggler man_, then it is just a matter how to present the data: html, text file, PDF file, Latex file or whatever. Perl is very good with databases with assistance of [DBI](https://metacpan.org/pod/DBI). – Polar Bear Mar 25 '20 at 21:17
  • Good idea to store in a DB but for I would need another script to extract the HTML data to stored properly in DB hehe and that is that difficult part – Ger Cas Mar 25 '20 at 21:41
1

Polar Bear's answer could be the best one. I just want to add a different idea about getting TABLE[2]ROW[1]COLUMN[2] without using HTML::TableExtract. You said you are new in perl so I think this idea will be interesting to you. The idea is to use regex. Ex:

$/ = "</html>";
my $table2, $row1, $col2;
while(<STDIN>){
    /<\/table>\s*<table>([^\000]*?)<\/table>/i;
    $table2 = $1;
    $table2 =~ /<tr>([^\000]*?)<\/tr>/i;
    $row1 = $1;
    $row1 =~ /<\/td>\s*<td>([^\000]*?)<\/td>/i;
    $col2 = $1;
}
print $col2;

This code will always get TABLE[2]ROW[1]COLUMN[2].

Sample input:

<html>
<table>

</table>
<table>
    <tr>
        <td>
          hello world
        </td>
        <td>
          corona 
        </td>
    </tr>
    <tr>
    </tr>
</table>
</html>

Output:

  corona 
glennmark
  • 524
  • 3
  • 13
  • Hi glennmark, thanks for your answer. I was trying to test your code. I pasted your code and saved it in `script.pl`and the sample input in `sample.html` but when I run it like this I don´t get any output `perl script.pl sample.html` How would be the correct way to test? – Ger Cas Mar 25 '20 at 17:34
  • The way you run it is correct. As i check your question, I see that some of the tags are in uppercase. I will edit my answer to handle those. You can also combine all your htmls into one html file during your testing. My code here is not exactly what you wanted. A better way is to loop through the files, get the data you wanted and then push it into an array (see Polar Bear's code for reference on how to do it) – glennmark Mar 26 '20 at 02:25
  • Thanks glennmark for your help and approach shared. – Ger Cas Mar 26 '20 at 20:53