Printing all HTML Tables with certain string for multiple files (perl)

Question

I am trying to print all the HTML tables containing the string "kcat" for each xml file in a directory but I am having some trouble. Note that each file in the directory (named kcat_tables) has at least one HTML table with kcat in it. I am running this program on an ubuntu virtual machine. Here is my code:

#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp;
use Path::Iterator::Rule;
use HTML::TableExtract;
use utf8::all;
my @papers_dir_path = qw(/home/bob/kinase/kcat_tables);

my $rule = Path::Iterator::Rule->new;
$rule->name("*.nxml");
$rule->skip_dirs(".");

my $xml;
my $it = $rule->iter(@papers_dir_path);

while ( my $file = $it->() ) {
    $xml = read_file($file);
    my $te = HTML::TableExtract->new();
    $te->parse($xml);
    foreach my $ts ( $te->tables ) {
        if ( $ts =~ /kcat/i ) {
            print "Table (", join( ',', $ts->coords ), "):\n";
            foreach my $row ( $ts->rows ) {
                print join( ',', @$row ), "\n";
            }
        }
    }
}

Any ideas on how I should fix this? Thanks in advance! Also, I am fairly new to the PERL language so a simple, comprehensible answer would be very much appreciated.

What is your exact problem? Do you get any errors? Or is output different of your expected result? Show input and output and desired outcome too, then is there much more hope to help you. — w.k, Feb 14 '15 at 02:44
When I run my code I get the following error: Use of uninitialized value in join or string at ./table_parser.pl line 39. Also, when something is outputted by the program it is in a very raw form and I cant really discern the table. So in other words, how can I get rid of that error and make the output more similar to a table format? — alphasugar, Feb 14 '15 at 02:54
reflowed your script. But it's not 39 lines long. (Would recommend getting hold of perltidy. It makes formatting your code nicely much easier) — Sobrique, Feb 14 '15 at 11:55
Can you also give an example of your source data? It makes it easier to grok. — Sobrique, Feb 14 '15 at 12:12
Is table_parser.pl *your* file, or is it a file from HTML/TableExtract/ ? I see something very fishy: `if( $ts =~ /kcat/i )`. If `$ts` is an object, it makes no sense to run it against a regular expression. Regular expressions are for strings, not objects. (unless if `=~` is somehow overloaded, but i can't find anything about that in the documentation for HTML::TableExtract). — mareoraft, Feb 15 '15 at 02:40
Yes, table_parser.pl is my file. I copied my code here and deleted my comments so sorry for the line 39 thing, but the join on line 39 is the following: `print join( ',', @$row ), "\n";`. Here is a sample source data: http://pastebin.com/bLauAYK3. Regarding mareoraft, I agree with you that I cant use regex on objects but how do I fix the code so that $ts holds the raw HTML code for the tables. Again, is there also a way for me to format the output in table format? If not, would I be able to simply open the file using a web browser which automatically renders the code? — alphasugar, Feb 15 '15 at 16:12

Tom · Answer 1 · 2015-02-18T09:40:53.067

0

You cannot apply a regex to an object, as you do in:

if ( $ts =~ /kcat/i ) {

I'd suggest, parsing the tables in 'tree' mode. For this, you'd have to install two additional perl modules: HTML::TreeBuilder and HTML::ElementTable. Enable it like this:

use HTML::TableExtract 'tree';

Here's the fixed while loop:

while ( my $file = $it->() ) {
  $xml = read_file($file);
  my $te = HTML::TableExtract->new();
  $te->parse($xml);
  foreach my $ts ( $te->tables ) {
    my $tree = $ts->tree or die $!;
    if ( $tree->as_text =~ /kcat/i ) {
      print "Table (", join( ',', $ts->coords ), "):\n";
      # update 18.2.2015: pretty print the table
      foreach my $row ($ts->rows) {
        print join ' | ', map {sprintf "%22s", $_->as_text} @{$row};
        print "\n";
        # which is the same as
        # foreach my $cell (@${$row}) { do something with $cell->as_text }
      }
    }
  }
}

$tree is an HTML::ElementTable object. The code above works with your sample.

edited Feb 18 '15 at 09:40

answered Feb 16 '15 at 16:38

Tom

101
1
7

I have imported the following to my program: `use HTML::TreeBuilder; use HTML::ElementTable; use HTML::TableExtract 'tree';` My while loop is the same as yours except I have added: `my $tree = HTML::ElementTable->new();`. If I dont include this previous line then the program gives me the following error: Global symbol "$tree" requires explicit package name. If I do include this line then I got the following error: Can't locate object method "ElementTable=HASH(0x1aebc70)" via package "HTML" (it is talking about this line `$tree = $ts->$tree or die $!;`. What is wrong? – alphasugar Feb 16 '15 at 18:12
Sorry. It actually did work! Thank you! Do you know of any way I can print it out in a nicer format (like in table format)? Currently, it is just a block of text. – alphasugar Feb 18 '15 at 00:24
I updated the sample code in my answer above, it prints it out as a table on the console. – Tom Feb 18 '15 at 09:42
I made the change but got the following error: Can't call method "as_text" on unblessed reference at ./table_parser.pl line 34. Line 34 is referring to `print join ' | ', map {sprintf "%22s", $_->as_text} @{$row};` Any ideas? Thanks for all your help! – alphasugar Feb 18 '15 at 23:33

Printing all HTML Tables with certain string for multiple files (perl)

1 Answers1