1

I'd like the code below (which was developed by F.Hoque) to download a PDF file from this website.

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver    
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC    
       
class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.ons.gov.uk',
            callback=self.parse,
            wait_time = 3,
            screenshot = True
        )

    def parse(self, response):
        driver = response.meta['driver']
        driver.save_screenshot('screenshot.png')

        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        driver.save_screenshot('screenshot_1.png')
        click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        driver.save_screenshot('screenshot_2.png')
        click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()

Also, I'm not sure which settings.py file to add this to (as it is needed for the code to run):

# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}


# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

I am using Spyder via Anaconda 3 and I have five different settings.py files. Here are their respective locations:

"C:\Users\David\anaconda3\Lib\site-packages\scrapy\commands\settings.py" 
"C:\Users\David\anaconda3\pkgs\bokeh-2.3.2-py38haa95532_0\Lib\site-packages\bokeh\settings.py" 
"C:\Users\David\anaconda3\Lib\site-packages\bokeh\settings.py" 
"C:\Users\David\anaconda3\pkgs\isort-5.8.0-pyhd3eb1b0_0\site-packages\isort\settings.py" 
"C:\Users\David\anaconda3\Lib\site-packages\isort\settings.py" 

Which of these settings.py files should I save the second code to?.

  • Which part of your code did you expect to download a file? – mkrieger1 Apr 24 '22 at 17:51
  • And what does the downloading part have to do with the settings part? I have the impression that these are two totally unrelated questions (except that they both have to do with Selenium). – mkrieger1 Apr 24 '22 at 17:55
  • You're taking a screenshot with Selenium, what exactly do you want to download with scrapy? – grumpyp Apr 24 '22 at 18:19
  • I'd like to execute the selenium code within scrapy in order to have it run faster. The selenium code downloads a file from a website. I'd like this scrapy-selenium code to do the same. –  Apr 24 '22 at 18:30
  • Check the path for your project, because it's possible that the screenshots are being save there. Or perhaps they are being saved to your "screenshots" folder. I would specify a path in the "driver.save_screenshot()" call, so that I know where the files are supposed to save. – Chowlett2 Apr 24 '22 at 18:44
  • When I run the file, this pops up on the console: runfile('C:/Users/David/Desktop/Selenium/untitled1.py', wdir='C:/Users/David/Desktop/Selenium'). I checked the folder there is nothing in there. I'm completely new to scrapy, is it supposed to save a screenshot I was expecting it to download the pdf file or am I completely off the mark here? –  Apr 24 '22 at 19:01
  • I have added more detail to the question, I hope it makes more sense now. –  Apr 24 '22 at 20:12

1 Answers1

0

Scrapy can download pdf files/images using media/image pipeline. See the output they contain only pdf link but not a file. You will notice that the url have no .pdf extention at the end rather than only link if it has .pdf then it would be a file and only then I can download pdf file from here using scrapy media pipeline.If you click on the output file then it will manually start to downlown.I don't know endpoint /pdf can ocnvert into .pdf then can download

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC




class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.ons.gov.uk',
            callback=self.parse,
            wait_time = 3,
            screenshot = True
        )

    def parse(self, response):
        driver = response.meta['driver']
        #driver.save_screenshot('screenshot.png')

        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        #driver.save_screenshot('screenshot_1.png')
        click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        #driver.save_screenshot('screenshot_2.png')
        click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        #No need to click because click and download not possible
        #click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()
        #driver.save_screenshot('screenshot_pdf.png')

        pdf_url= driver.find_element_by_xpath('//*[@class="link-complex js-pdf-dl-link"]').get_attribute('href')
        
        yield {'url': pdf_url}
       

Output:

{'url': 'https://www.ons.gov.uk/peoplepopulationandcommunity/educationandchildcare/articles/remoteschoolingthroughthecoronaviruscovid19pandemicengland/april2020tojune2021/pdf'}

       
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • Thank you for the clarification. So I guess scrapy will not be suitable for my needs then. I have a website that I need to log into using a username and password and then query and download multiple reports in xls/pdf format. I wanted to know if I could combine multiple selenium scripts into one and have it run at the same time. I thought that scrapy will be good for this. I posted a question here regarding how to merge selenium code and run them at the same time. I haven't found a solution yet unfortunately. –  Apr 24 '22 at 22:22
  • https://stackoverflow.com/questions/71976269/how-can-i-make-this-selenium-code-run-in-parallel –  Apr 24 '22 at 22:23