Get value from HTML table with PERL

Question

I am trying to get values from already existing html table with exact td (cell). Can anyone help me with it?

The existing table's code is as below.

<table>
<tr><td class="key">FIRST NAME</td><td id="firstname" class="value">ALEXANDR</td></tr>
<tr><td class="key">SURNAME NAME</td><td id="surname" class="value">PUSHKIN</td></tr>
<tr><td class="key">EMAIL</td><td id="email" class="value">apushkin@mail.ru</td></tr>
<tr><td class="key">TELEPHONE</td><td id="telephone" class="value">+991122334455</td></tr>
</table>

I tried this below perl script but it does not work.

$pp = get("http://www.domain.com/something_something");
$out[0]="/home/.../public_html/perl_output.txt";
($firstname) = ($str =~ /<td id="firstname" class="value">(.+?)<\/firstname/);
($surname) = ($str =~ /<td id="surname" class="value">(.+?)<\/surname/);
($email) = ($str =~ /<td id="email" class="value">(.+?)<\/email/);
($telephone) = ($str =~ /<td id="telephone" class="value">(.+?)<\/telephone/);

print "First Name: $firstname \n";
print "Last Name: $surname \n";
print "Email: $email \n";
print "Telephone: $telephone \n";

exit;

Can anyone guide me?

score 4 · Answer 1 · edited May 23 '17 at 11:45

4

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

Because Web::Scraper is for HTML documents, this is not going to work with the website that OP wants to scrape. It uses XML. See my other answer for a solution that deals with XML.

Don't try to parse HTML with regular expressions! Use an HTML parser instead.

For web scraping I prefer Web::Scraper. It does everything from fetching the page to parsing the content in a very simple DSL.

use strict;
use warnings;
use Web::Scraper;
use URI;
use Data::Dumper;

my $people = scraper {
    # this will parse all tables and put the results into the key people
    process 'table', 'people[]' => scraper {
        process '#firstname', first_name => 'TEXT'; # grab those ids
        process '#lastname',  last_name  => 'TEXT'; # and put them into
        process '#email',     email      => 'TEXT'; # a hashref with the
        process '#telephone', phone      => 'TEXT'; # 2nd arg as key
    };
    result 'people'; # only return the people key
};
my $res = $people->scrape( URI->new("http://www.domain.com/something_something") );

print Dumper $res;

__DATA__
$VAR1 = [
  {
    firstname => 'ALEXANDR',
    lastname => 'PUSHKIN',
    email => 'apushkin@mail.ru',
    phone => '+991122334455',
  }
]

If one of the fields, like email or firstname occur multiple times in one table, you can use an array reference for that. In that case the document's HTML would not be valid because of the double ids though. Use a different selector and pray it works.

 process '#email', 'email[]' => 'TEXT';

Now you'll get this kind of structure:

{
  email => [
   'foo@example.org',
   'bar@example.org',
  ],
}

edited May 23 '17 at 11:45

Community

1
1

answered Feb 18 '16 at 14:00

simbabque

53,749
8
73
136

Note: I haven't run this code because there was no real URL supplied and Web::Scraper doesn't play well with `__DATA__`. – simbabque Feb 18 '16 at 14:02
many thanks, how the code would look like if there are more than 1 email addresses and telephone numbers. foreach code should be somehow be included is not it? – Feb 18 '16 at 15:05
1

Give us an example of the HTML including multiple values. – Dave Cross Feb 18 '16 at 15:14
@esqeudero: Yes, we would need example data. It depends if it's normalized. – simbabque Feb 18 '16 at 15:15
for example I want to get value of each published papers (articles) from the existing metadata at the link (http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc). I need only these value, but there may be more than 1 author: #dc_title #dc_author #dc_affiliation #dc_email #dc_jel #dc_keywords #dc_description #dc_format #dc_source #dc_year #dc_volume #dc_issue #dc_pages #dc_pdfurl – Feb 18 '16 at 15:22
@esqeudero in that case you have one table per record, and there are multiple author fields, and multiple email fields, but no way to match the mail to the name. As it happens, that document is not valid as it uses the same `id` multiple times. You would have to tune the the selectors (after `process '#email'` a bit to match the right stuff. – simbabque Feb 18 '16 at 15:25
Updated the answer. I admit, that thing is a bit hard to parse. Fun how they use XML namespaces, but cannot get the `id` right. – simbabque Feb 18 '16 at 15:29
@esqeudero Oh. This is actually not HTML, but XML. Meh. Next time, please directly post the real thing, or read properly. There is no HTML document, it's XML and there is an XSLT that the browser uses to render it into HTML. In that case, we will need a different approach. – simbabque Feb 18 '16 at 15:36

score 1 · Accepted Answer · edited May 23 '17 at 12:31

Since it came out that the document is actually XML, here is a solution that uses an XML parser to deal with it, and also takes into account multiple fields. XML::Twig is very useful for this, and it even lets us download the document.

use strict;
use warnings;
use XML::Twig;
use Data::Printer;

my @docs; # we will save the docs here
my $twig = XML::Twig->new(
    twig_handlers => {
        'oai_dc:dc' => sub {
            my ($t, $elt) = @_;

            my $foo = {
                # grab all elements of type 'dc:author" inside our 
                # element and call text_only on them
                author => [ map { $_->text_only } $elt->descendants('dc:author') ],
                email => [ map { $_->text_only } $elt->descendants('dc:email') ],
            };

            push @docs, $foo;
        }
    }
);

$twig->parseurl("http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc");

p @docs;

__END__

[
    [0]  {
        author   [
            [0] "Nazila Isgandarova"
        ],
        email    [
            [0] "azerwomensc@yahoo.ca"
        ]
    },
    [1]  {
        author   [
            [0] "Mette Nordahl Grosen",
            [1] "Bezen Balamir Coskun"
        ],
        email    [
            [0] "m.grosen@gmail.com",
            [1] "bezenbalamir@gmail.com"
        ]
    },
# ...

Now I stole my own accepted answer. That's a first. :D – simbabque Feb 18 '16 at 16:33 — simbabque, Feb 18 '16 at 16:33

score 0 · Answer 3 · answered Feb 18 '16 at 17:05

First, you really should use an XML parser.

Now to some possible reasons why the code does not work:

Your regular expressions expect an ending tag, e.g. </firstnamewhich does not exist in your HTML.

If the HTML is plain and reliable and you really want a regex it should better look like this:

m/<td    
  [^>]+    # anything but '>'
  id="firstname"
  [^>]+    # anything but '>'
  >
  ([^<]+?) # anything but '<'
  <
/xms;

This does not take into account case insensitivity of HTML, decoding of HTML-entities, other allowed quote characters.

Get value from HTML table with PERL

3 Answers3

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.