5

Spider for reference:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem



    class RunSpider(scrapy.Spider):
        name = "run"
        allowed_domains = ["stopitrightnow.com"]
        start_urls = (
            'http://www.stopitrightnow.com/',
        )



        def parse(self, response):


            for widget in response.xpath('//div[@class="shopthepost-widget"]'):
                #print widget.extract()
                item = ScriptItem()
                item['url'] = widget.xpath('.//a/@href').extract()
                url = item['url']
                #print url
                yield item

When I run this the output in terminal is as follows:

2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>

This is the html:

<div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls">
    <a class="stp-control stp-left stp-hidden">&lt;</a>
    <div class="stp-inner" style="width: auto">
        <div class="stp-slide" style="left: -0%">
                        <a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878713">
                            </a>
                        <a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878708">

To me it seems to hit a block when trying to activate the Javascript. I am aware that javascript can not run in scrapy but there must be a way of getting to those links. I have looked at selenium but can not get a handle on it.

Any and all help welcome.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Wine.Merchant
  • 147
  • 1
  • 2
  • 9

2 Answers2

7

I've solved it with ScrapyJS.

Follow the setup instructions in the official documentation and this answer.

Here is the test spider I've used:

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = "run"
    allowed_domains = ["stopitrightnow.com"]
    start_urls = (
        'http://www.stopitrightnow.com/',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for widget in response.xpath('//div[@class="shopthepost-widget"]'):
            print widget.xpath('.//a/@href').extract()

And here is what I've got on the console:

[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_']
[u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_']
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • This is fantastic, however it just hangs in console for ages doing this: 2015-08-21 16:36:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-08-21 16:37:00 [scrapy] DEBUG: Gave up retrying (failed 3 times): TCP connection timed out: 60: Operation timed out. 2015-08-21 16:37:00 [scrapy] ERROR: Error downloading : TCP connection timed out: 60: Operation timed out. – Wine.Merchant Aug 21 '15 at 15:39
  • @Wine.Merchant have you started the splash docker container? Thanks. – alecxe Aug 21 '15 at 15:40
  • I opened docker and put in the right line $ docker run -p 8050:8050 scrapinghub/splash and then this happend 2015-08-21 15:22:19.651375 [-] Starting factory – Wine.Merchant Aug 21 '15 at 15:47
  • I stopped and restarted container scrapinghub/splash but it is still times out. Any more observations as to why this may be happening? – Wine.Merchant Aug 21 '15 at 16:10
  • There was a self.parse inside request in the other example but having tried it nothing changed, still timing out. Don't suppose it could have anything to do with this "Telnet console listening on 127.0.0.1:6023"? – Wine.Merchant Aug 21 '15 at 16:51
  • 1
    No, the Telnet console doesn't interfere with Splash. Test that your Splash is working by going to 192.168.59.103:8050 in a browser. If it isn't, Splash is not working or not reachable, and the issue isn't in the spider. – Rejected Aug 21 '15 at 17:07
  • The issue is not with the spider, I can not access that address through browser. It just times out. – Wine.Merchant Aug 24 '15 at 09:24
6

A non-javascript alternative to Alecxe's is to inspect where the page is loading the content from manually, and adding in that functionally (see this SO question for more details).

In this case, we get the following: Network traffic

So, for <div class="shopthepost-widget" data-widget-id="708473">, Javascript is executed to embed the url "widgets.rewardstyle.com/stps/708473.html".

You could handle this yourself by manually generating a request for these URLs yourself:

def parse(self, response):
    for widget in response.xpath('//div[@class="shopthepost-widget"]'):
        widget_id = widget.xpath('@data-widget-id').extract()[0]
        widget_url = "http://widgets.rewardstyle.com/stps/{id}.html".format(id=widget_id)
        yield Request(widget_url, callback=self.parse_widget)

def parse_widget(self, response):
    for link in response.xpath('//a[contains(@class, "stp-product")]'):
        item = JavasItem()  # Name provided by author, see comments below
        item['link'] = links.xpath("@href").extract()
        yield item

    # Do whatever else you want with the opened page.

If you need to keep these widgets associated with whatever post/article they are a part of, pass that information into the request via meta.

EDIT: parse_widget() was updated. It uses contains for figuring out the class, as it has a space at the end. You could alternatively use a CSS selector, but it's really your call.

Community
  • 1
  • 1
Rejected
  • 4,445
  • 2
  • 25
  • 42
  • This seems like it would be a nicer solution without having to rely on a third party. The terminal says it goes to the source page of the widget but I still can not items from the page to be stored. Could you please give an example of how you would get the links to be stored using parse_widget? – Wine.Merchant Aug 24 '15 at 15:11
  • I am using this. or sel in response.xpath('//a[@class="stp-control stp-left stp-hidden"]'): item = JavasItem() item['itemUrl'] = response.xpath('.//a/@href').extract() yield item – Wine.Merchant Aug 24 '15 at 15:12
  • You need to look at the HTML directly at the target link, NOT on the page where it's loaded inline. When it's loaded via JS, that JS can edit/append/remove classes based on the page its loading into. Since Scrapy isn't processing JS or seeing this, you'll get differing results. I've updated the `parse_widget` function to extract all links on the page. – Rejected Aug 24 '15 at 15:38
  • Speedy reply, very much appreciated as is all the help you have offered. – Wine.Merchant Aug 24 '15 at 16:51