Perl regular expression for html

Question

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. I have looked at the page source for this page and come up with the following regex

use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";

if (is_success( $content = LWP::Simple::get($url) ) ) {
    print "$url is alive!\n";
} else {
    print "No movies found";
}

$code = "";

if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
    $code = $1;
}

I am getting an internal server error at this line

$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

I am very new to perl, and would be grateful if anyone could point out my mistake(s).

I was doing some web scraping recently and found that the html sent to my browser was subtly different from that sent to my program (because of different responses generated for different user agent types). Did you examine the html in your browser? — Sheena, Oct 23 '12 at 05:43
What about using their API? See: http://www.omdbapi.com/ This way you could reduce your parsing effort to a minimum. — Nippey, Oct 23 '12 at 05:46
@Sheena Yes the html sent to my program looked the same as the source. — Kartik, Oct 23 '12 at 07:25
@Nippey Sadly I cannot use an API as it is a part of an assignment which instructs me not to use an API. — Kartik, Oct 23 '12 at 07:26

score 12 · Accepted Answer · edited May 23 '17 at 12:27

12

Use an HTML parser. Regular expressions cannot parse HTML.

Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. It should look like this:

/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

edited May 23 '17 at 12:27

Community

1
1

answered Oct 23 '12 at 05:26

Peter C

6,219
1
25
37

@Kartik no problem. Sorry I didn't give an example for HTML::Parser -- I would have liked to, as that's the correct solution -- but I haven't touched Perl for a long while. – Peter C Oct 23 '12 at 15:22

memowe · Answer 2 · 2012-10-24T10:09:49.913

A very nice interface for this type of work is provided by some tools of the Mojolicious distribution.

Long version

The combination of its UserAgent, DOM and URL classes can work in a very robust way:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;

# preparations
my $ua  = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";

# try to load the page
my $tx = $ua->get($url);

# error handling
die join ', ' => $tx->error unless $tx->success;

# extract the url
my $movie_link  = $tx->res->dom('a[href^=/title]')->first;
my $movie_url   = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];

Output:

tt0381061

Short version

The funny one liner helper module ojo helps to build a very short version:

$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'

Output:

tt0381061

score 0 · Answer 3 · answered Oct 23 '12 at 06:27

I agree XML is anti-line-editing thus anti-unix but, there is AWK.

If awk can do, perl can surely do. I can produce a list:

curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '

$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq

Pass search string to "ID". Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. Should be easier in perl.

Perl regular expression for html

3 Answers3

Long version

Short version

Linked