0

I had a scrapy project with Python 2.7 and now I am moving to Python 3.6 but I have encountered a 'problem'. Whenever I use a scrapy Selector to get the page_source that I need to scrape from the driver, it prints the whole driver.page_source in my terminal which makes debugging difficult and more importantly makes the spider go way slower.

For example, in this case it would print out the whole page_source when executed normally with 'scrapy crawl myspider':

from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep

class MySpider(Spider):
name = 'myspider'
allowed_domains = [domain]
start_urls = [url]

def parse(self, response):
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--window-size=1600,1200')

    self.driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options = chrome_options)

    self.driver.get(url2)
    sleep(3)
    sel = Selector(text=self.driver.page_source)

I know how to make the spider 'not to print' the debugging page_source but still the Spider goes slow. What I have done is the following:

import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)

But the spider keeps going way slower than with Python 2.7, anyone knows why?

Jorge Garcia
  • 117
  • 9
  • Am I wrong or you creating new instance of chrome everytime when parsing page? Create it at least somewhere outside this function and use only driver.get(url2) inside. – Hellohowdododo Dec 28 '18 at 10:31
  • Also - why sleep(3)? Use expected conditions to wait exact moment when your element shows up. – Hellohowdododo Dec 28 '18 at 10:32
  • Also, try this: https://stackoverflow.com/questions/46744968/how-to-suppress-console-error-warning-info-messages-when-executing-selenium-pyth – Hellohowdododo Dec 28 '18 at 10:34

0 Answers0