3

My goal is to extract the links from the tables titled "Agonists," "Antagonists," and "Allosteric Regulators" in the following site:

http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1

I've been using HTML::TableExtract to extract the tables but have been unable to get HTML::LinkExtor to retrieve the links in question. Here is the code I have so far:

use warnings;
use strict;
use HTML::TableExtract;
use HTML::LinkExtor;

my @names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html2/`; 

foreach (@names)
{
chomp ($_);

my $te = HTML::TableExtract->new( headers => [  "Ligand", 
                        "Sp.", 
                        "Action", 
                            "Affinity", 
                        "Units",
                        "Reference" ] );
my $le = HTML::LinkExtor->new();

$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");

my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't reset";
close RESET;
#open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't append to file";

foreach my $ts ($te->tables)
{
    foreach my $row ($ts->rows)
    {
        $le->parse($row->[0]);
        for my $link_tag ( $le->links ) 
        {
            my %links = @$link_tag;
            print @$link_tag, "\n";
            }
        }
}
#print "Links extracted from $_\n";
}

I've tried using some sample code from another thread on this site (Perl parse links from HTML Table) to no avail. I'm not sure whether it's a problem of parsing or table recognition. Any help provided would be greatly appreciated. Thanks!

Community
  • 1
  • 1
Wally
  • 137
  • 4
  • 12
  • WWW::Mechanize will take care of the link parsing for you. `my $mech = WWW::Mechanize->new; $mech->get($url); my @links=$mech->links;` The table extraction you'll have to do on your own. – Andy Lester Aug 14 '13 at 20:30
  • @AndyLester - That's one thing that would be useful in Mech, to be able to extract links/inputs only within certain elements, or only before and/or after some element. Sometimes the current select parameters in find_all_links just aren't quite enough. – runrig Aug 14 '13 at 22:42
  • Agreed, but now I am running into a different problem in that the links are coming back at references. As I am a complete novice (I was a biochemist for years previously), I am figuring out how to dereference the array. I've tried adding a forward slash ("\") in front of the array and assigning it to another one, but that didn't work, either. Any suggestions? Thanks! – Wally Aug 15 '13 at 15:14
  • @Wally - the links are objects, described in the WWW::Mechanize::Link docs. `$link->url_abs()` to get the full URL. – runrig Aug 15 '13 at 18:32
  • Many thanks for y'alls efforts, but I ended up doing things the caveman way and using regular expressions to extract the links I needed to pair up with something else. The links were unique enough that I could do so, so problem solved! – Wally Aug 16 '13 at 14:09

2 Answers2

4

Try this as a base script (you only need to adapt it to fetch links) :

use warnings; use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
use WWW::Mechanize;

use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

my $m = WWW::Mechanize->new( autocheck => 1, quiet => 0 );
$m->agent_alias("Linux Mozilla");
$m->cookie_jar({});

my $te = HTML::TableExtract->new(
    headers => [
        "Ligand",
        "Sp.",
        "Action",
        "Affinity",
        "Units",
        "Reference"
    ]
);

$te->parse(
    $m->get("http://tinyurl.com/jvwov9m")->content
);

foreach my $ts ($te->tables) {
    print "Table (", join(',', $ts->coords), "):\n";
    foreach my $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
2

You don't describe what the problem is...what exactly doesn't work? What does $row->[0] contain? But part of the problem might be that TableExtract returns just the 'visible' text, not the raw html, by default. You probably want to use the keep_html option in HTML::TableExtract.

runrig
  • 6,486
  • 2
  • 27
  • 44