1

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. I have looked at the page source for this page and come up with the following regex

use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";

if (is_success( $content = LWP::Simple::get($url) ) ) {
    print "$url is alive!\n";
} else {
    print "No movies found";
}

$code = "";

if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
    $code = $1;
}

I am getting an internal server error at this line

$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

I am very new to perl, and would be grateful if anyone could point out my mistake(s).

gpoo
  • 8,408
  • 3
  • 38
  • 53
Kartik
  • 23
  • 4
  • I was doing some web scraping recently and found that the html sent to my browser was subtly different from that sent to my program (because of different responses generated for different user agent types). Did you examine the html in your browser? – Sheena Oct 23 '12 at 05:43
  • 4
    What about using their API? See: http://www.omdbapi.com/ This way you could reduce your parsing effort to a minimum. – Nippey Oct 23 '12 at 05:46
  • @Sheena Yes the html sent to my program looked the same as the source. – Kartik Oct 23 '12 at 07:25
  • @Nippey Sadly I cannot use an API as it is a part of an assignment which instructs me not to use an API. – Kartik Oct 23 '12 at 07:26

3 Answers3

12

Use an HTML parser. Regular expressions cannot parse HTML.

Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. It should look like this:

/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
Community
  • 1
  • 1
Peter C
  • 6,219
  • 1
  • 25
  • 37
  • @Kartik no problem. Sorry I didn't give an example for HTML::Parser -- I would have liked to, as that's the correct solution -- but I haven't touched Perl for a long while. – Peter C Oct 23 '12 at 15:22
3

A very nice interface for this type of work is provided by some tools of the Mojolicious distribution.

Long version

The combination of its UserAgent, DOM and URL classes can work in a very robust way:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;

# preparations
my $ua  = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";

# try to load the page
my $tx = $ua->get($url);

# error handling
die join ', ' => $tx->error unless $tx->success;

# extract the url
my $movie_link  = $tx->res->dom('a[href^=/title]')->first;
my $movie_url   = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];

Output:

tt0381061

Short version

The funny one liner helper module ojo helps to build a very short version:

$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'

Output:

tt0381061
memowe
  • 2,656
  • 16
  • 25
0

I agree XML is anti-line-editing thus anti-unix but, there is AWK.

If awk can do, perl can surely do. I can produce a list:

curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '

$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq

Pass search string to "ID". Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. Should be easier in perl.

MeaCulpa
  • 881
  • 1
  • 6
  • 14