I had a scrapy project with Python 2.7 and now I am moving to Python 3.6 but I have encountered a 'problem'. Whenever I use a scrapy Selector to get the page_source that I need to scrape from the driver, it prints the whole driver.page_source in my terminal which makes debugging difficult and more importantly makes the spider go way slower.
For example, in this case it would print out the whole page_source when executed normally with 'scrapy crawl myspider':
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep
class MySpider(Spider):
name = 'myspider'
allowed_domains = [domain]
start_urls = [url]
def parse(self, response):
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--window-size=1600,1200')
self.driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options = chrome_options)
self.driver.get(url2)
sleep(3)
sel = Selector(text=self.driver.page_source)
I know how to make the spider 'not to print' the debugging page_source but still the Spider goes slow. What I have done is the following:
import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)
But the spider keeps going way slower than with Python 2.7, anyone knows why?