2

I'm trying to use scrapy with selenium to be able to interact with javascript and still have the powerful scraping framework that scrapy offers. I've written a script that visits http://www.iens.nl, enters "Amsterdam" in the search bar and then clicks on the search button succesfully. After clicking on the search button I want scrapy to retreive an element from the newly rendered page. Unfortunately scrapy doesn't return any values.

This is what my code looks like:

from selenium import webdriver
from scrapy.loader import ItemLoader
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from properties import PropertiesItem
import scrapy


class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]
    # Start on a property page
    start_urls = ['http://www.iens.nl']

    def __init__(self):
        chrome_path = '/Users/username/Documents/chromedriver'
        self.driver = webdriver.Chrome(chrome_path)

    def parse(self, response):
        self.driver.get(response.url)
        text_box = self.driver.find_element_by_xpath('//*[@id="searchText"]')
        submit_button = self.driver.find_element_by_xpath('//*[@id="button_search"]')
        text_box.send_keys("Amsterdam")
        submit_button.click()

        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('description', '//*[@id="results"]/ul/li[1]/div[2]/h3/a/')

        return l.load_item()


process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()

"properties" is another script that looks like this:

from scrapy.item import Item, Field

class PropertiesItem(Item):
    # Primary fields
    description = Field()

Q: How do I succesfully make scrapy find the element I call "description" by its xpath on the page selenium reached and return it as output?

Thanks in advance!

titusAdam
  • 779
  • 1
  • 16
  • 35
  • @eLRuLL it did reach `parse`, otherwise selenium wouldn't have moved to the next page right? – titusAdam Jan 10 '17 at 14:50
  • You may want to have a look at this, for other ways to couple Scrapy with Selenium: http://stackoverflow.com/a/36085533/1204332 – Ivan Chaer Jan 10 '17 at 15:02

1 Answers1

5

the response object you are assigning to your ItemLoader is the scrapy response, not Selenium's.

I would recommend creating a new Selector with the page source returned by selenium:

from scrapy import Selector
...

selenium_response_text = driver.page_source

new_selector = Selector(text=selenium_response_text)
l = ItemLoader(item=PropertiesItem(), selector=new_selector)
...

that way the add_xpath will get information from that response structure instead of scrapy (that you don't actually need).

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Like I said; I want to use scrapy to scrape the data because of its speed! I know how to use selenium. :) – titusAdam Jan 10 '17 at 15:07
  • @titusAdam speed is not really a thing in selenium. If you want speed you either need to ditch selenium completely or replace it with something that supports asynchronous rendering. i.e. [Splash](http://splash.readthedocs.io/en/stable/) – Granitosaurus Jan 10 '17 at 15:09
  • @Granitosaurus is it possible to move through pages with splash like selenium does in this example? – titusAdam Jan 10 '17 at 15:11
  • @titusAdam it's possible to do it with scrapy __alone__, without any javascript rendering. – Granitosaurus Jan 10 '17 at 15:12
  • @Granitosaurus how? Scrapy doesn't support javascript right? – titusAdam Jan 10 '17 at 15:13
  • @titusAdam no, scrapy doesn't render javascript code. However you don't need any javascript rendering to scrape data from iens.nl, though it would require quite a bit more effort, see related: http://stackoverflow.com/questions/8550114/ – Granitosaurus Jan 10 '17 at 15:17
  • @Granitosaurus I know, but I want to learn how to use Scrapy with websites that need javascript to be rendered. – titusAdam Jan 10 '17 at 15:21
  • 1
    @titusAdam I just answered the question you posted: *How do I succesfully make scrapy find the element I call "description" by its xpath on the page selenium reached and return it as output?* – eLRuLL Jan 10 '17 at 15:21
  • sorry for derailing @eLRuLL :D – Granitosaurus Jan 10 '17 at 15:23
  • @Granitosaurus Is there a way to click buttons/enter text/log-in on websites that use javascript and then scrape data from the newly rendered page? – titusAdam Jan 10 '17 at 15:29
  • @titusAdam yes, you can scrape pretty much every website without rendering any javascript, but it will be significantly harder since you need to reverese engineer the whole process the website goes through to serve you data. It takes more effort but in general is significantly faster and less resource intensive. Sometimes it's not worth bothering and using Splash can give you very good results, you can do pretty much everything in Splash what you could do in Selenium. – Granitosaurus Jan 10 '17 at 15:33