0

I'm interested in getting a better idea of what scrapy can do. Here is a very simple selenium code that interacts with a website, fills in some boxes, clicks some elements and downloads a file. Could this code be replicated using scrapy?, so that a code is written using scrapy that does the exact same thing.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options=Options()
options.add_argument("--window-size=1920,1080")

driver=webdriver.Chrome(options=options)
   
driver.get("https://www.ons.gov.uk/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()

1 Answers1

0

"selenium code be recreated using scrapy" is also working fine with SeleniuRequest which is superfast than general selenium. You need scrapy project.It works as headless mode but always get screenshot for each step.

script:

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC




class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.ons.gov.uk',
            callback=self.parse,
            wait_time = 3,
            screenshot = True
        )

    def parse(self, response):
        driver = response.meta['driver']
        driver.save_screenshot('screenshot.png')

        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        driver.save_screenshot('screenshot_1.png')
        click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        driver.save_screenshot('screenshot_2.png')
        click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()
    

Screenshot

settings.py file:

You have to add the following options in settings.py file

# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}


# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

SeleniumRequest

Output:

'downloader/response_status_count/200'

screenshot of the project looks like

How to download pdf using scrapy

screenshot

Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • 1
    Thank you very much for this. I'm new to Python so I may be a little slower to grasp things. Thankfully, I've understood the first half of the script. I'm not too sure what the settings.py file is. I'd kindly like to ask where it is located and how I can update it?. Ultimately, I'm planning to use the power of Scrapy to try and run two different selenium scripts at the same time. I hope it's possible. I posted a question here regarding it: https://stackoverflow.com/questions/71976269/how-can-i-make-this-selenium-code-run-in-parallel-using-scrapy?noredirect=1#comment127184203_71976269 –  Apr 23 '22 at 22:40
  • @David Copperfield creating scrapy project and well-familiar with the project files is a bit more important but seems to be complex, don't worry about it, definetly you will understand everything but you have to spend more time on it. You can read and learn about project setting portion from [here](https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/). Please a bit more from online and practice and you will succeed. Thanks – Md. Fazlul Hoque Apr 23 '22 at 22:58
  • Just go to you project folder and and you will find settings.py file and open it by clicking and in bellow or anywhere in your setings.py file copy and paste middleware and selenium portion on it. – Md. Fazlul Hoque Apr 23 '22 at 23:06
  • Thanks for the encouragement. I am using Spyder to run my script which I got after installing anaconda. I looked inside my 'anaconda3' folder. I have five 'settings.py' files. All in different locations. "C:\Users\David\anaconda3\Lib\site-packages\scrapy\commands\settings.py" "C:\Users\David\anaconda3\pkgs\bokeh-2.3.2-py38haa95532_0\Lib\site-packages\bokeh\settings.py" "C:\Users\David\anaconda3\Lib\site-packages\bokeh\settings.py" "C:\Users\David\anaconda3\pkgs\isort-5.8.0-pyhd3eb1b0_0\site-packages\isort\settings.py" "C:\Users\David\anaconda3\Lib\site-packages\isort\settings.py" –  Apr 23 '22 at 23:54
  • I'm not too sure which one to update. –  Apr 23 '22 at 23:56
  • Just curious, when you run the code did it download the pdf document?. –  Apr 24 '22 at 18:32
  • @David Copperfield,Thanks. No , scrapy didn't download pdf file because they use media pipeline to download image and pdf and it's an another question but according to your question code is working. I've added a link above. – Md. Fazlul Hoque Apr 24 '22 at 19:10
  • Regarding the settings.py and PDF question. I posted the question here: https://stackoverflow.com/questions/71990861/where-do-i-save-the-settings-py-file-to-for-this-scrapy-selenium-code-and-also-h –  Apr 24 '22 at 20:15