is it possible to make that scraper act in extra pages when the webpage have it?

Question

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

my code opens only a single page from each url of the file, sometimes there are more pages, in that case the pattern for the next pages would be &page=x

here are those pages i'm talking about:

http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track&page=7

It's not clear what the question is asking for. What exactly is it that you're trying to do? Could you provide a concrete example? — Haldean Brown, Nov 12 '12 at 17:42
well, i edited the post... basically i want it to try to get all the next pages those addresses happen to have. — muchacho, Nov 12 '12 at 18:20

payala · Accepted Answer · 2012-11-13T13:34:49.523

1

You could read the href attribute from the next_page link and add it to your urls list (yes, you should change your tuple to a list). It could be something like this:

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2
import urlparse

with open('urls.txt') as inf:
    urls = [line.strip() for line in inf]
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

        next_page = soup.find_all('a', {'class': 'nextlink'}):
        if next_page:
            next_page = next_page[0]
            urls.append(urlparse.urljoin(url, next_page['href']))

edited Nov 13 '12 at 13:34

answered Nov 12 '12 at 23:22

payala

1,357
13
28

i'm getting this IOError: [Errno 2] The system cannot find the path specified: '\\user\\Skotopes\\library\\tags?tag=rock&page=2' – muchacho Nov 13 '12 at 08:37
The href is relative, you need to add the base url in order to have an absolute url. See my edit above. – payala Nov 13 '12 at 08:48
As you probably have already imagined, the next_page list was empty in the last page. So, by checking that the list is not empty, you know you can access your next page link. – payala Nov 13 '12 at 13:36
yes, i forgot to mention that when there's more than 1 url on the file it got the same error, but in the second or third page – muchacho Nov 13 '12 at 15:14

score 0 · Answer 2 · answered Nov 12 '12 at 17:46

YOu could create something that gets all links from page and follows them, something scrapy does for free

You can create a spider which will follow all links on the page. Assuming that there are pagination links to the other pages, your scraper will automatically follow them.

You can accomplish the same thing by parsing all links on the page with beautifulsoup, but why do that if scrapy already does it for free?

score -1 · Answer 3 · answered Nov 12 '12 at 17:54

-1

I'm not sure I understand your question, but you might think about creating some regex (http://www.tutorialspoint.com/python/python_reg_expressions.htm) that matches your 'next' pattern, and searching for it amongst the found URLS on a page. I use this approach a lot when there is a high degree of conformance in the intra-site links.

answered Nov 12 '12 at 17:54

Jay Gattuso

3,890
12
37
51

Which I have read a number of times. It turns out when you know the source its not an unsafe way of processing. I used this method more than 30 times on different sites, and have never ever encountered anything but success. Is it 'the right way'? clearly not. Does it work? Yes. In the above usage, you are not 'consuming' HTML, but parsing 'regular text' (text that is not being evaluated for its HTML components, but for its language/semantic components) against a regular expression - exactly what regex is designed to do. Arguments against using this approach (given the above) may follow.... – Jay Gattuso Nov 13 '12 at 06:07

is it possible to make that scraper act in extra pages when the webpage have it?

3 Answers3