5

i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf

I am still getting the page without the phone number rendered:

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        script = """
        function main(splash)
            splash:go(splash.args.url)
            splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
            splash:wait(0.5)
            return splash:html()
        end
        """
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents, meta={
                'splash': {
                    'args': {'lua_source': script},
                    'endpoint': 'execute',
                }
            })

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):

        import ipdb;ipdb.set_trace()

how can i get this to work?

psychok7
  • 5,373
  • 9
  • 63
  • 101

2 Answers2

3

Add

splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")

to Lua script and it will work.

function main(splash)
    splash:go(splash.args.url)
    splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")
    splash:runjs('document.getElementById("contact_methods").getElementsByTagName("span")[1].click();')
    splash:wait(0.5)
    return splash:html()
end

.click() is JQuery function https://api.jquery.com/click/

marvin
  • 31
  • 3
2

You can avoid having to use Splash in the first place and make the appropriate GET request to get the phone number yourself. Working spider:

import json
import re

import scrapy   

class OlxSpider(scrapy.Spider):
    name = "olx"
    rotate_user_agent = True
    allowed_domains = ["olx.pt"]
    start_urls = [
        "https://olx.pt/imoveis/"
    ]

    def parse(self, response):
        for href in response.css('.link.linkWithHash.detailsLink::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_house_contents)

        for next_page in response.css('.pager .br3.brc8::attr(href)'):
            url = response.urljoin(next_page.extract())
            yield scrapy.Request(url, self.parse)

    def parse_house_contents(self, response):
        property_id = re.search(r"ID(\w+)\.", response.url).group(1)

        phone_url = "https://olx.pt/ajax/misc/contact/phone/%s/" % property_id
        yield scrapy.Request(phone_url, callback=self.parse_phone)

    def parse_phone(self, response):
        phone_number = json.loads(response.body)["value"]
        print(phone_number)

If there are more things to extract from this "dynamic" website, see if Splash is really enough and, if not, look into browser automation and selenium.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I actually need this to work because I will be moving to more complex js sites with date picker calendars and stuff – psychok7 Mar 03 '16 at 20:09
  • 1
    @psychok7 are you sure scrapyjs would be enough for your complex dynamic web-site? Maybe switching to `selenium` would make things go faster and simpler.. – alecxe Mar 03 '16 at 20:13
  • I am trying it out.. I have no idea if it's possible or not.. But I will look into selenium as well thanks – psychok7 Mar 03 '16 at 20:19
  • @psychok7 okay, added a note about `selenium` to the answer. Sorry for not solving your Splash specific problem, but I would personally solve that via `selenium`..well, may be, partially because I'm more familiar with it than Splash, but my impression is that Splash would not universally solve the "dynamicness" problem as a real browser would..just a thought.. – alecxe Mar 03 '16 at 21:48
  • i accepted your answer as i saw that selenium is much more mature but i am running into some issues maybe you could help me out? here is my question http://stackoverflow.com/questions/35799855/scrapy-selenium-datepicker – psychok7 Mar 04 '16 at 15:33