perl: extending code to get particular information from website

Question

I have an ID code representative of a protein. There is website called InterPro that is used to deal with proteins related information. URL for that website contains that particular code. By changing that code in that URL I can get information about any protein. I developed a script in perl to get information directly from web. I used following code

    my $uniprot= "P15700";
    my $resp= '';
    my $url= "http://wwwdev.ebi.ac.uk/interpro/ISearch?query=$uniprot+";
    my $file = "$uniprot";
    $resp = getstore( $url, $file ); 


exit;
}

In this example P15700 is that unique ID of protein and the url is http://wwwdev.ebi.ac.uk/interpro/ISearch?query=P15700+. Now this will retrieve whole html page. But, I need a particular information on that page which under the heading of "protein family membership". Like in this example If you open this link you can find "Adenylate kinase" written under heading of protein family membership. I just neeed to that information in another output text file displaying a table in which one column says ID and other says that information under protein family membership. I am new to perl and I don't have computer science background. Rather I am a biologist. So, I want to know whether the above mentioned task can be done using perl. If yes, How? I ll be grateful if anyone can solve this problem.

Why would you want to parse HTML when you've got countless Webservices at your hand? Have a look at the [dbfetch](http://www.ebi.ac.uk/Tools/dbfetch/) tool ([example](http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=P15700&format=annot&style=default&Retrieve=Retrieve)) - And if it is "just" Interpro there are already [Perl clients](http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest). And then there's [BioPerl](https://metacpan.org/module/BioPerl) - especially [Bio::Index::EMBL](https://metacpan.org/module/Bio::Index::EMBL). — Sebastian Stumpf, May 18 '12 at 16:23
Sebastian Stumpf, make that into an answer with actual code examples that fulfil the question's requirement and have one upvote from me guaranteed. — daxim, May 18 '12 at 17:13

score 4 · Answer 1 · answered May 18 '12 at 16:08

4

use strictures;
use Web::Query 'wq';
my $w = wq 'http://wwwdev.ebi.ac.uk/interpro/ISearch?query=P15700+';
$w->find('.prot_fam a')->text;
# expression returns:
# (
#     'Adenylate kinase',
#     'UMP-CMP kinase',
# )

answered May 18 '12 at 16:08

daxim

39,270
4
65
132

I am really new to all this. Would you mind telling me what is strictures for? – Shipra May 21 '12 at 00:42
[strictures](http://p3rl.org/strictures), [Why use strict and warnings?](http://stackoverflow.com/questions/8023959/why-use-strict-and-warnings), [Use strict and warnings](http://www.perlmonks.org/?node_id=111088) – daxim May 21 '12 at 06:46

nab · Answer 2 · 2012-05-18T16:27:13.253

2

This relates to parsing web page HTML which IMO is rarely a good idea. The page may change at any time and that will cause you script to stop working properly. If you are still interested here's the solution:

use Mojo::DOM;    
my $dom = Mojo::DOM->new($resp);
my $name = $dom->find('div.prot_fam a')->[0]->text;

Now $name variable will hold the Adenylate kinase string.

edited May 18 '12 at 16:27

answered May 18 '12 at 15:53

nab

4,751
4
31
42

Wohoo... that's really awesome - 3 lines - respect ! But 5MB for a module, i've never heard of ? – int2000 May 18 '12 at 16:08
1

You said "This relates to parsing web page HTML using Perl which IMO is rarely a good idea". I think you probably meant "This relates to parsing web page HTML which IMO is rarely a good idea". There's nothing about screen-scraping in Perl which makes it any less of a good idea than it would be in any other language. IMO :) – Dave Cross May 18 '12 at 16:25

score 0 · Answer 3 · edited May 23 '17 at 12:20

Everything can be done using Perl :) As for this particular problem, take a look at this question of mine concerning recursive web download and DOM code.

As you're not a programmer, much of this will be news to you.

Let's understand the DOM first. It's the HTML tree built in the browser when viewing web pages. You can acquire a decent understanding of the DOM playing around with Firebug or the equivalent plug-ins or built-ins for Chrome and IE and Opera, whichever one you're using.

So you will have to go to this page and analyze its DOM. It looks like the info you're looking for is in a <div class="prot_fam"> element. So that's all the info you need to write the code:

D:\ :: more /t2 prot.pl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;

my $url  = shift || die 'pass URL as argument!';
my $file = shift || die 'pass output filename as argument!';

my $ua = LWP::UserAgent->new;
my $rsp = $ua->mirror( $url, $file );
if ( ! $rsp->is_success ) {
  die $rsp->status_line;
}

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( $file ) or die;

print $_, "\n" for map $_->as_XML_indented,
$tree->findnodes(q( //div[@class='prot_fam'] ));

D:\ :: perl prot.pl http://wwwdev.ebi.ac.uk/interpro/ISearch?query=P15700 P15700.html
<div class="prot_fam">
  <div class="entry-parent">
    <div class="entry-parent">
     <a href="IEntrySummary?ac=IPR000850&amp;query=P15700">Adenylate kinase</a>
      <div class="entry-child-prot">
        <div class="entry-parent">
         <a href="IEntrySummary?ac=IPR006266&amp;query=P15700">UMP-CMP kinase</a>
        </div>
      </div>
    </div>
  </div>
</div>

Adding another sample using Mojo::DOM

use strict;
use warnings;
use LWP::UserAgent;
use Mojo::DOM;
my $url = shift || die 'URL!';
my $ua  = LWP::UserAgent->new;
my $rsp = $ua->get( $url );
my $dom = Mojo::DOM->new($rsp->content);
for ( $dom->find('div[class="prot_fam"]')->each ) {
    print $_->find('a'), "\n";
}

int2000 · Accepted Answer · 2012-05-21T11:19:42.993

0

Not even sexy, but it works (based on HTML::Treebuilder Module) - you have to parse the HTML and extract the information. In this example the result will be stored as csv in the File "result.txt"

use LWP::Simple;
use HTML::TreeBuilder;

my $uniprot= "P15700";
my $url= "http://wwwdev.ebi.ac.uk/interpro/ISearch?query=$uniprot+";
my $resp = get( $url );

my $tree = HTML::TreeBuilder->new_from_content($resp);
my $first=$tree->look_down(_tag => 'div',class => 'prot_fam') ;
$first=$first->look_down(_tag => 'div',class => 'entry-parent');
$first=$first->look_down(_tag => 'div',class => 'entry-parent');
$first=$first->look_down(_tag => 'a');
open (FH,">>result.txt");
print FH $uniprot.";";
print FH $first->content_list;
print FH "\n";
close(FH);

Edit: Here's a variant for checking lot's of "uniprots". Play around with the sleepdelay

use LWP::Simple;
use HTML::TreeBuilder;

my @ports=qw(Q9H4B7 Q96RI1 P04150 P35354 P23219 P61073 P0A3M6 Q8DR59 Q7CRA4 Q27738 P35367 P35367 P35367 P08172 P35367 P10275 P25021 P07550 P08588 P13945);

for (my $i=0;$i < scalar(@ports);$i++) {
my $url= "http://wwwdev.ebi.ac.uk/interpro/ISearch?query=".$ports[$i]."+";
my $resp = get( $url );

my $tree = HTML::TreeBuilder->new_from_content($resp);
my $first=$tree->look_down(_tag => 'div',class => 'prot_fam') ;
$first=$first->look_down(_tag => 'div',class => 'entry-parent');
$first=$first->look_down(_tag => 'div',class => 'entry-parent');
$first=$first->look_down(_tag => 'a');
open (FH,">>result.txt");
print FH $ports[$i].";";
print FH $first->content_list;
print FH "\n";
close(FH);
sleep 10;
}

edited May 21 '12 at 11:19

answered May 18 '12 at 15:54

int2000

565
3
14

I used the above mentioned script for a list of few codes. List was in another text file and I used slrup. It worked perfectly for few of codes. But, I used a bigger list it stopped working. error message was **Can't call method "look_down" at line `$first=$first->look_down(_tag => 'div',class => 'entry-parent');` ** i can't figure out where is the problem? please help – Shipra May 21 '12 at 09:40
Please give me an example for the list / query. – int2000 May 21 '12 at 10:25
Q9H4B7 Q96RI1 P04150 P35354 P23219 P61073 P0A3M6 Q8DR59 Q7CRA4 Q27738 P35367 P35367 P35367 P08172 P35367 P10275 P25021 P07550 P08588 P13945 – Shipra May 21 '12 at 11:01
it is list in a text file that I slrup at the place of $uniprot – Shipra May 21 '12 at 11:04
Looks like the site has problems. Even in Chrome i got an error-Message, when i try to the enter the URL manually. I think it's a problem of the page. When you put some delay between the requests, it's a little bit better (but still not good :( ) – int2000 May 21 '12 at 11:18
and now I think website is ok – Shipra May 21 '12 at 12:39
Looks like Site has some kind of "Anti-Crawler"-Feature. There's no further workaround. You can catch the exception (Google: "Perl Try Catch") and repeat the search if an exception happens. Or You take Sebastian Stumpfs advice (first Comment on your Question) and use another service/module. – int2000 May 21 '12 at 12:50
thanks anyways for the advice :) it helped me in learning new things – Shipra May 21 '12 at 13:31
This code works. But it stops where the tree changes a bit. I want it to work further. And if it finds a tree that doesnt match it should skip it and move forward checking another code. Can this be solved? Please help – Shipra May 23 '12 at 11:03

perl: extending code to get particular information from website

4 Answers4