1

I have a seed url (say DOMAIN/manufacturers.php) with no pagination that looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="st-text">
        <table cellspacing="6" width="600">
            <tr>
                <td>
                    <a href="manufacturer1-type-59.php"></a>
                </td>

                <td>
                    <a href="manufacturer1-type-59.php">Name 1</a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php"></a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php">Name 2</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer3-type-88.php"></a>
                </td>

                <td>
                    <a href="manufacturer3-type-88.php">Name 3</a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php"></a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php">Name 4</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer5-type-28.php"></a>
                </td>

                <td>
                    <a href="manufacturer5-type-28.php">Name 5</a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php"></a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php">Name 6</a>
                </td>
            </tr>
        </table>
    </div>
</body>
</html>

From there I would like to get all a['href'] 's, for example: manufacturer1-type-59.php. Note that these links do NOT contain the DOMAIN prefix already so my guess is that I have to add it somehow, or maybe not?

Optionally, I would like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The content of each of these links, such as manufacturer1-type-59.php, looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="makers">
        <ul>
            <li>
                <a href="manufacturer1_model1_type1.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model1_type2.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model2_type3.php"></a>
            </li>
        </ul>
    </div>

    <div class="nav-band">
        <div class="nav-items">
            <div class="nav-pages">
                <span>Pages:</span><strong>1</strong>
                <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
                <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
                <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
            </div>
        </div>
    </div>
</body>
</html>

Next, I would like to get all a['href'] 's, for example manufacturer_model1_type1.php. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php redirects to manufacturer-type-STRING-59-INT-p2.php.

Optionally, I would also like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php, extract the title, and save result in a file in the following form: (url, title, ).

EDIT

This is what I have done so far but doesn't seem to work...

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ArchiveItem(scrapy.Item):
    url = scrapy.Field()

class ArchiveSpider(CrawlSpider):
    name = 'gsmarena'
    allowed_domains = ['gsmarena.com']
    start_urls = ['http://www.gsmarena.com/makers.php3']
    rules = [
        Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
        Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
        Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
    ]

    def parse_archive(self, response):
        torrent = ArchiveItem()
        torrent['url'] = response.url
        return torrent
user706838
  • 5,132
  • 14
  • 54
  • 78
  • Is it a public site and you can share the URL? (would help to help) – alecxe Apr 08 '15 at 18:20
  • Sure, the above example is based on the following seed url (http://www.gsmarena.com/makers.php3). However, the most important thing for me is to understand the underling idea. Still, if you could send me an working example it would be much easier for me to understand all these concepts. :) – user706838 Apr 08 '15 at 18:30
  • hi @alecxe I just added what I have tried so far (although it doesn't work - yet!). could you please have a look? thanks! – user706838 Apr 08 '15 at 20:56
  • @alecxe hi! my solution above looks like ok but I would like to use proxies as well. any ideas? – user706838 Apr 09 '15 at 16:00
  • Sorry for coming back lately. Have you solved your current issue? – alecxe Apr 10 '15 at 19:39
  • @alecxe Hi, thanks a lot for your response. My solution seems ok but I get banned from the website after crawling about 300 urls. So, it would be nice to know how `scrapy` utilizes `proxies` with a handful example. I would really appreciated it! :) – user706838 Apr 11 '15 at 05:45

1 Answers1

2

I think you better use BaseSpider instead of CrawlSpider

this code might help

class GsmArenaSpider(Spider):
    name = 'gsmarena'
    start_urls = ['http://www.gsmarena.com/makers.php3', ]
    allowed_domains = ['gsmarena.com']
    BASE_URL = 'http://www.gsmarena.com/'

def parse(self, response):
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
    if markers:
        for marker in markers:
            yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)

def parse_marker(self, response):
    url = response.url
    # extracting phone urls
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
    if phones:
        for phone in phones:
            # change callback function name as parse_events for first crawl
            yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
    else:
        return

    # pagination
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
    if next_page:
        yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)

def parse_phone(self, response):
    # extract whatever stuffs you want and yield items here
    pass

EDIT

if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_marker then the request will look like

 yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})
Jithin
  • 1,692
  • 17
  • 25
  • hi! any particular reason why you are in favor of `Spider` instead of `CrawlSpider`? Btw, my solution above looks like ok but I would like to use proxies as well. any ideas? – user706838 Apr 09 '15 at 16:01
  • difference between crawl and base spider refere [here](http://doc.scrapy.org/en/latest/topics/spiders.html) you can use proxy middleware [proxy-middleware](http://stackoverflow.com/questions/20792152/setting-scrapy-proxy-middleware-to-rotate-on-each-request) – Jithin Apr 09 '15 at 16:04
  • thanks for the link in regards to `proxies`; i find it difficult though to understand how is works - yet! Ideally, what I would like to do is give a list of ip:port, for example, [ip1:port1, ip2:port2, ip3:port3] and and let `CrawlSpider` chose one at a time (randomly). note that I still want to use `CrawlSpider`. so, how can I do that? – user706838 Apr 09 '15 at 18:15