Python: Extracting numbers from result

Question

I was working on a python script to automatically extract ratings from imdb, only I am unable to extract the numbers from my result.

from pattern.web import URL
from pattern.web import plaintext
from pattern.web import decode_utf8
import re

def scrape_imdb(film):
    url = URL (film)
    s=url.download()
    decode_utf8(url.download(s))
    regels=re.compile(('"ratingValue">[0-9].[0-9]'))
    rating= regels.findall(s)
    rating2= rating[0:1]
    rating3= rating2.findall("[0-9"])

    regels2=re.compile ("<title>.*</title>")
    titel=regels2.findall(s)
    print titel, rating2

But this gives me an error. Anyone know what I'm doing wrong?

Someone will only be able to resolve this if you post your exact error message too. — Rohit Jain, Feb 18 '13 at 21:51
Please for the love of god don't scrape popular websites, its against the terms of service and normally gets you IP banned! Please see http://stackoverflow.com/a/7744369/462604 — Jakob Bowyer, Feb 18 '13 at 21:52
`rating2.findall("[0-9"])` <- The ending quote character is in the wrong spot. — eldarerathis, Feb 18 '13 at 21:53
Possible duplicate of - http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api — Matt Busche, Feb 18 '13 at 21:59

score 3 · Accepted Answer · answered Feb 18 '13 at 22:05

As you wrote in a comment to another answer:

I still get: AttributeError: 'list' object has no attribute 'findall'

So this seems to be your problem. re.findall returns a list of matches, so rating is a list. When you then do rating2 = rating[0:1], you assign a sublist to rating2, so rating2 itself is a list too (with a single element though). A list does not have a findall method so this fails.

What you probably want to do is run another regular expression on the first result in rating:

rating = regels.findall(s)
rating2 = rating[0] # only get the first element; a string
rating3 = re.findall("[0-9]", rating2)

score 0 · Answer 2 · edited Feb 18 '13 at 21:59

0

I believe you have a typo here:

rating3= rating2.findall("[0-9"])

It should be:

rating3= rating2.findall("[0-9]")

edited Feb 18 '13 at 21:59

Matt Busche

14,216
5
36
61

answered Feb 18 '13 at 21:55

Rahul Banerjee

2,343
15
16

Even when i correct the error, I still get: AttributeError: 'list' object has no attribute 'findall' – Shifu Feb 18 '13 at 22:00

Python: Extracting numbers from result

2 Answers2