1

So I recently just started trying out Scrapy for a project, and I got very much confused with the various older syntaxes (SgmlLinkExtractor etc.) but I somehow managed to put together what I thought was legible code which made sense to me. However, this does not traverse through every page in the website, instead only goes to the start_urls page and doesn't produce the output file. Can someone please explain what I'm missing?

import scrapy
import csv
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class RLSpider(CrawlSpider):
    name = "RL"
    allowed_domains='ralphlauren.com/product/'
    start_urls=[
        'http://www.ralphlauren.com/'
    ]
    rules = (
        Rule(LinkExtractor(),callback="parse_item",follow=True),
    )

    def parse_item(self, response):
        name = response.xpath('//h1/text()').extract_first()
        price = response.xpath('//span[@class="reg-price"]/span/text()').extract_first()
        image=response.xpath('//input[@name="enh_0"]/@value').extract_first()
        print("Rules=",rules)
        tup=(name,price,image)
        csvF=open('data.csv','w')
        csvWrite = csv.writer(csvF)
        csvWrite.writerow(tup)
        return []
    def parse(self,response):
        pass

I'm trying to extract data from the website and write it into a csv file from all pages coming under /product/

Here are the logs :

2016-12-07 19:46:49 [scrapy] INFO: Scrapy 1.2.2 started (bot: P35Crawler)
2016-12-07 19:46:49 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'P35Crawler
', 'NEWSPIDER_MODULE': 'P35Crawler.spiders', 'SPIDER_MODULES': ['P35Crawler.spid
ers']}
2016-12-07 19:46:49 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-07 19:46:50 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-07 19:46:50 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-07 19:46:50 [scrapy] INFO: Spider opened
2016-12-07 19:46:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-12-07 19:46:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-07 19:46:51 [scrapy] DEBUG: Redirecting (302) to <GET http://www.ralphla
uren.com/home/index.jsp?ab=Geo_iIN_rUS_dUS> from <GET http://www.ralphlauren.com
/>
2016-12-07 19:46:51 [scrapy] DEBUG: Crawled (200) <GET http://www.ralphlauren.co
m/home/index.jsp?ab=Geo_iIN_rUS_dUS> (referer: None)
2016-12-07 19:46:51 [scrapy] INFO: Closing spider (finished)
2016-12-07 19:46:51 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 497,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 20766,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 7, 14, 16, 51, 973406),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 12, 7, 14, 16, 50, 287464)}
2016-12-07 19:46:51 [scrapy] INFO: Spider closed (finished)

1 Answers1

0

You shouldn't overwrite the parse() method with an empty one. So just remove the declaration of that method. Please, let me know if this helps.

UPDATE

Regarding your comment on parsing JS with scrapy, there are different ways to do that. You need a browser to parse JS. Let's say you would want to try Firefox and control it with Selenium.

The best way IMO is to implement a download handler, as I explain on this answer. You could, otherwise, implement a downloader middleware, as explained here. The middleware has some downsides compared to the handler, as the download handler would allow you to use the default cache, retry, etc.

After you get the basic script working with Firefox, you can then switch to PhantomJS by changing just a few lines. PhantomJS is a headless browser, which means that it doesn't need to load all the browser interface. So it's much faster.

Other solutions include using Docker with Splash, but I ended up considering this an overkill, as you need to run a VM just to control the browser.

So summing up, the best solution is to implement a download handler that makes use of Selenium and PhantomJS.

Community
  • 1
  • 1
Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48
  • Hi, thanks that worked, but now with allowed_domains restricted to ralphlauren.com , everything is getting filtered out and still only crawling the first website. – Vishnu Bhagyanath Dec 07 '16 at 16:24
  • Okay I noticed most of the links I needed are in the JavaScript part of the website, how do I crawl through those? – Vishnu Bhagyanath Dec 07 '16 at 16:56