1

I've attempted to build a program to scrape the web for company management teams. It's very accurate at obtaining many things, including:

-names

-job titles

-images

-emails

-Qualifications (MD, PhD, ect) and Suffixes (II, III, JR.)

The issue I'm running into is scraping the person's description. For instance on Facebook's Executive Bios page I would want Mark Zuckerberg's description. However, with all the differences in HTML structure, it is very difficult to scrape this with close to 100% accuracy.

I am using Perl and many, what I believe to be advanced, regular expressions. Is there a better way / tool to approach the problem with?

My latest attempt was to find the last occurrence of the persons full name on the page, then take all text until I hit a co-workers name. While this seems like it would work it gives me less than desirable results.

EDIT: I realized this question came off as just trying to parse this specific page, I need something that is general enough to work on any companies "people-page". I know 100% accuracy is unachievable, looking for something that would get me to 50% plus as currently I'm down around 15-20 percent.

user387049
  • 6,647
  • 8
  • 53
  • 55
  • 2
    Actually, that page is trivial to scrape using any HTML parser. All the information is contained in elements with distinct class names. Of course, using regular expressions to parse HTML is in general an error prone and frustrating task. So, use an HTML parser. – Sinan Ünür Nov 19 '10 at 14:17
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Mark Thomas Nov 19 '10 at 15:22
  • Yes this page is trivial agreed, I need something that is general enough to work on any page (or at least 50-60%). I just grabbed facebook page to show an example of the content I'm going for. – user387049 Nov 19 '10 at 17:47
  • You will never find a regular expression that will be "general enough to work on any page". – Ether Nov 19 '10 at 18:16
  • 1
    @Ether then is there any other approach that would be general enough to work on 50-60% of pages? – user387049 Nov 19 '10 at 18:20
  • @user: well, you'd have to examine the HTML structure from a sampling of pages, and attempt to find a structure that works for most of them. – Ether Nov 19 '10 at 21:50

2 Answers2

4

Using regular expressions for parsing HTML will certainly fail at one time or the other.

Few modules that could help with parsing HTML are:

If you need more control over parsing HTML, you could use HTML::Parser.

Furthermore, there have been several questions on parsing HTML using Perl in StackOverflow. The answers there can be helpful.

A sample scraper for the Facebook Executive Bios page, which makes use of LWP::UserAgent to fetch page content and HTML::TreeBuilder for parsing:

#!/usr/bin/env perl

use strict;
use warnings;

use LWP::UserAgent;
use HTML::TreeBuilder;

binmode STDOUT, ':utf8';

my $ua = LWP::UserAgent->new( 'agent' => 'Mozilla' );
my $response = $ua->get('http://www.facebook.com/press/info.php?execbios');

my $tree = HTML::TreeBuilder->new();
if ( $response->is_success() ) {
    $tree->parse_content( $response->decoded_content() );
}
else {
    die $response->status_line();
}

for my $biosummary_tag ( $tree->look_down( 'class' => 'biosummary' ) ) {
    my $bioname_tag  = $biosummary_tag->look_down( 'class' => 'bioname'  );
    my $biotitle_tag = $biosummary_tag->look_down( 'class' => 'biotitle' );
    my $biodescription_tag
      = $biosummary_tag->look_down( 'class' => 'biodescription' );

    my $bioname        = $bioname_tag->as_text();
    my $biotitle       = $biotitle_tag->as_text();
    my $biodescription = $biodescription_tag->as_text();

    print "Name:        $bioname\n";
    print "Title:       $biotitle\n";
    print "Description: $biodescription\n\n";
}
Community
  • 1
  • 1
Alan Haggai Alavi
  • 72,802
  • 19
  • 102
  • 127
  • What's here the difference between "( $p_tag->content_list() )[0]" and "$p_tag->as_text" ? – sid_com Nov 19 '10 at 15:22
  • sid_com: `content_list()` returns child nodes, whereas `as_text()` returns text within child nodes. Clearly, `as_text()` is the method that should be used in this case. I have updated my answer. Thank you for noting. – Alan Haggai Alavi Nov 19 '10 at 16:18
  • 1
    WWW::Mechanize will not help with parsing HTML content, other than links and images. – Andy Lester Nov 19 '10 at 17:45
  • this is great for this specific url, but I need something more general that can work on almost any website. Obviously I can never achieve 100%, but even some approach that could get me 50 or 60% descriptions correctly would be amazing. – user387049 Nov 19 '10 at 17:46
  • user387049: I hope that you will be able to learn from the above example and write your own scraper for your specific URLs. Else, let us know where you need help with. – Alan Haggai Alavi Nov 20 '10 at 01:36
1

You are never going to get 100%, or not with today's technology.

The most reliable way is to have markup the source, but as you are web scraping you don't have this. Rather than regex, you could try some more sophisticated Natural Language Processing (NLP) techniques. I don't know what is available for Perl, but Python's NLTK is good for getting started. It is a toolkit designed so you can pick and choose what you need to extract the info you need, plus there are a couple of good books out there - including the open sourced O'Reilly book Natural Language Processing with Python.

daxim
  • 39,270
  • 4
  • 65
  • 132
winwaed
  • 7,645
  • 6
  • 36
  • 81