0

I've created a script in python using scrapy in combination with selenium to parse the links of different restaurants from it's main page and then parse the name of each restaurant from their inner page.

How callback (or pass arguments between methods) works without sending requests when I use scrapy in association with selenium?

The following script works overriding the callback using self.driver.get(response.url) which I can't get rid of:

import scrapy
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

class YPageSpider(scrapy.Spider):
    name = "yellowpages"
    link = 'https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA'

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)
        yield scrapy.Request(self.link,callback=self.parse)

    def parse(self,response):
        self.driver.get(response.url)
        for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".v-card .info a.business-name"))):
            yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_info)

    def parse_info(self,response):
        self.driver.get(response.url)
        elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".sales-info > h1"))).text
        yield {"title":elem}

if __name__ == '__main__':
    c = CrawlerProcess()
    c.crawl(YPageSpider)
    c.start()
MITHU
  • 113
  • 3
  • 12
  • 41

2 Answers2

1

Do you mean passing variables from function to function? Why don't use meta for this? It is working anyway, with Selenium or without. I use the same code as you, just two small updates:

def parse(self,response):
    self.driver.get(response.url)
    for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".v-card .info a.business-name"))):
        yield scrapy.Request(elem.get_attribute("href"),
                             callback=self.parse_info,
                             meta={'test': 'test'})  # <- pass anything here

def parse_info(self,response):
    self.driver.get(response.url)
    elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".sales-info > h1"))).text
    yield {"title": elem, 'data': response.meta['test']}  # <- getting it here

So, it outputs:

...
2019-05-16 17:40:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/san-francisco-ca/mip/pizza-hut-473437740?lid=473437740>
{'data': 'test', 'title': u'Pizza Hut'}
...
vezunchik
  • 3,669
  • 3
  • 16
  • 25
  • Right you are @vezunchik, I can pass variable between methods like so. However, the problem is my script is sending requests using `yield scrapy.Request()` and then again navigating using `self.driver.get(response.url)`. Doing two different things for a single purpose. Ain't there any way so that my script will pass links from one method to another and then navigate because you know the script is uselessly sending requests using this `yield scrapy.Request()` as it doesn't have any effect anyway. – MITHU May 16 '19 at 16:00
  • Ah, got it! Very interesting. I've found this promising thread about your problem https://stackoverflow.com/questions/31174330/passing-selenium-response-url-to-scrapy/31186730#31186730 but will try it only later. Maybe it will help you till this time. – vezunchik May 16 '19 at 16:24
  • Here is an interesting solution to the two request problem, more overhead than the link from @venunchik but worth reading: https://stackoverflow.com/questions/50714354/scrapy-selenium-requests-twice-for-each-url/50715420#50715420 – pwinz May 17 '19 at 14:47
1

The linked answer which @vezunchik has already pointed out almost gets you there. The only problem is that when you use that exact same code then you will have multiple instance of chromedriver. To reuse the same driver multiple times, you can try like below.

Create a file within your scrapy project middleware.py and paste the below code:

from scrapy.http import HtmlResponse
from selenium import webdriver

class SeleniumMiddleware(object):
    def __init__(self):
        chromeOptions = webdriver.ChromeOptions()
        chromeOptions.add_argument("--headless")
        self.driver = webdriver.Chrome(options=chromeOptions)

    def process_request(self, request, spider):
        self.driver.get(request.url)
        body = self.driver.page_source
        return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)

Thought to come up with an update in case you wanna see how chmoedriver traverses in visible mode. To let the browser rove visibly, try this instead:

from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse
from scrapy.xlib.pydispatch import dispatcher

class SeleniumMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def process_request(self, request, spider):
        self.driver.get(request.url)
        body = self.driver.page_source
        return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)

    def spider_closed(self):
        self.driver.quit()

Use the following script to get the required content. There will only be a single request (navigation) for each url using selenium through middleware. Now you can use Selector() within your spider to fetch the data like I've shown below.

import sys
# The hardcoded address leads to your project location which ensures that
# you can add middleware reference within crawlerprocess
sys.path.append(r'C:\Users\WCS\Desktop\yourproject')
import scrapy
from scrapy import Selector
from scrapy.crawler import CrawlerProcess

class YPageSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ['https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA']

    def parse(self,response):
        items = Selector(response)
        for elem in items.css(".v-card .info a.business-name::attr(href)").getall():
            yield scrapy.Request(url=response.urljoin(elem),callback=self.parse_info)

    def parse_info(self,response):
        items = Selector(response)
        title = items.css(".sales-info > h1::text").get()
        yield {"title":title}

if __name__ == '__main__':
    c = CrawlerProcess({
            'DOWNLOADER_MIDDLEWARES':{'yourspider.middleware.SeleniumMiddleware': 200},
        })
    c.crawl(YPageSpider)
    c.start()
SIM
  • 21,997
  • 5
  • 37
  • 109