Scrapy + Splash + ScrapyJS

Question

i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf

I am still getting the page without the phone number rendered:

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        script = """
        function main(splash)
            splash:go(splash.args.url)
            splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
            splash:wait(0.5)
            return splash:html()
        end
        """
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):

        import ipdb;ipdb.set_trace()

how can i get this to work?

marvin · Answer 1 · 2016-03-05T16:32:53.873

Add

splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")

to Lua script and it will work.

function main(splash)
    splash:go(splash.args.url)
    splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
    splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
    splash:wait(0.5)
    return splash:html()
end

.click() is JQuery function https://api.jquery.com/click/

alecxe · Accepted Answer · 2016-03-03T21:13:32.980

2

You can avoid having to use Splash in the first place and make the appropriate GET request to get the phone number yourself. Working spider:

import json
import re

import scrapy   

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents)

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):
        property_id = re.search(r"ID(\w+)\.", response.url).group(1)

        phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id
        yield scrapy.Request(phone_url, callback=self.parse_phone)

    def parse_phone(self, response):
        phone_number = json.loads(response.body)["value"]
        print(phone_number)

If there are more things to extract from this "dynamic" website, see if Splash is really enough and, if not, look into browser automation and selenium.

edited Mar 03 '16 at 21:13

answered Mar 03 '16 at 19:34

alecxe

462,703
120
1,088
1,195

I actually need this to work because I will be moving to more complex js sites with date picker calendars and stuff – psychok7 Mar 03 '16 at 20:09
1

@psychok7 are you sure scrapyjs would be enough for your complex dynamic web-site? Maybe switching to `selenium` would make things go faster and simpler.. – alecxe Mar 03 '16 at 20:13
I am trying it out.. I have no idea if it's possible or not.. But I will look into selenium as well thanks – psychok7 Mar 03 '16 at 20:19
@psychok7 okay, added a note about `selenium` to the answer. Sorry for not solving your Splash specific problem, but I would personally solve that via `selenium`..well, may be, partially because I'm more familiar with it than Splash, but my impression is that Splash would not universally solve the "dynamicness" problem as a real browser would..just a thought.. – alecxe Mar 03 '16 at 21:48
i accepted your answer as i saw that selenium is much more mature but i am running into some issues maybe you could help me out? here is my question http://stackoverflow.com/questions/35799855/scrapy-selenium-datepicker – psychok7 Mar 04 '16 at 15:33

Scrapy + Splash + ScrapyJS

2 Answers2

Linked