4

He created a spider in Scrapy: items.py:

from scrapy.item import Item, Field

class dns_shopItem (Item):
     # Define the fields for your item here like:
     # Name = Field ()
     id = Field ()
     idd = Field ()

dns_shop_spider.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from dns_shop.items import dns_shopItem
 
class dns_shopLoader (XPathItemLoader):
     default_output_processor = TakeFirst ()
 
class dns_shopSpider (CrawlSpider):
    name = "dns_shop_spider"
    allowed_domains = ["www.playground.ru"]
    start_urls = ["http://www.playground.ru/files/stalker_clear_sky/"]
    rules = (
    Rule (SgmlLinkExtractor (allow = ('/ files / s_t_a_l_k_e_r_chistoe_nebo')), follow = True),
    Rule (SgmlLinkExtractor (allow = ('/ files / s_t_a_l_k_e_r_chistoe_nebo')), callback = 'parse_item'),
    )

    def parse_item (self, response):
        hxs = HtmlXPathSelector (response)
        l = dns_shopLoader (dns_shopItem (), hxs)
        l.add_xpath ('id', "/ html / body / table [2] / tbody / tr [5] / td [2] / table / tbody / tr / td / div [6] / h1/text ()" )
        l.add_xpath ('idd', "/ / html / body / table [2] / tbody / tr [5] / td [2] / table / tbody / tr / td / div [6] / h1/text () ")
        return l.load_item ()

Run the following command:

scrapy crawl dns_shop_spider-o scarped_data_utf8.csv-t csv

This log shows that Scrapy through all the necessary url, but why not write to the specified file when you start the spider. In what could be the problem?

Talvalin
  • 7,789
  • 2
  • 30
  • 40
user2420607
  • 43
  • 1
  • 3

1 Answers1

2

Assuming you want to follow all links on the page http://www.playground.ru/files/stalker_clear_sky/ and get titles, urls and links for downloading:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector

from scrapy.item import Item, Field


class PlayGroundItem(Item):
    title = Field()
    url = Field()
    download_url = Field()


class PlayGroundLoader(XPathItemLoader):
    default_output_processor = TakeFirst()


class PlayGroundSpider(CrawlSpider):
    name = "playground_spider"
    allowed_domains = ["www.playground.ru"]
    start_urls = ["http://www.playground.ru/files/stalker_clear_sky/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('/files/s_t_a_l_k_e_r_chistoe_nebo')), follow=True, callback='parse_item'),
    )


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        l = PlayGroundLoader(PlayGroundItem(), hxs)
        l.add_value('url', response.url)
        l.add_xpath('title', "//div[@class='downloads-container clearfix']/h1/text()")
        l.add_xpath('download_url', "//div[@class='files-download-holder']/div/a/@href")

        return l.load_item()

Save it to the spider.py and run via:

scrapy runspider test_scrapy.py -o output.json

Then check output.json.

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I can not figure out where to click something, what would make you a raise? – user2420607 May 26 '13 at 18:09
  • Ticked response tick. Still would like to ask why my xpath query did not work, and your work? They are: l.add_xpath ('title', "/ / div [@ class = 'downloads-container clearfix'] / h1/text ()") and: l.add_xpath ('title', ". / / * [@ id = 'mainTable'] / tbody / tr [5] / td [2] / table / tbody / tr / td / div [6] / h1/text () ") running only the first. I wrote xpath query using Firebug for Mozilla Firefox. And as you write xpath queries? – user2420607 May 27 '13 at 12:44
  • Yes, I use browser developer tools as well. But the xpaths they generate usually can be dramatically simplified. – alecxe May 27 '13 at 13:33
  • And what kind of developer tools you use to write xpath query? – user2420607 May 27 '13 at 16:50
  • I use chrome developer tools to inspect elements on the page. Usually, element can be found on a page easily by `id` or `class` of the element itself or it's parents. – alecxe May 27 '13 at 18:22