7

I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.

class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    def start_requests(self):

        driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
        
     
        floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']

        for date in floorsheet_dates:
            driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))    
            for data in range(z, z + 1):
                driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.url = driver.page_source
                yield Request(url=self.url, callback=self.parse)

               
    def parse(self, response, **kwargs):
        for value in response.xpath('//tbody/tr'):
            print(value.css('td::text').extract()[1])
            print("ok"*200)

Update: Error after answer is

2022-01-14 14:11:36 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.

Karma
  • 1
  • 1
  • 3
  • 9
lord stock
  • 1,191
  • 5
  • 32

1 Answers1

3

The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.

Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse


class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    def start_requests(self):

        # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
        driver = webdriver.Chrome()

        floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

        for date in floorsheet_dates:
            driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))
            for data in range(1, z + 1):
                driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.body = driver.page_source

                response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
                for value in response.xpath('//tbody/tr'):
                    print(value.css('td::text').extract()[1])
                    print("ok"*200)

        # return an empty requests list
        return []

Solution 2 - with super simple downloader middleware:

(You might have a delay here in parse method so be patient).

import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By


class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        url = spider.driver.current_url
        body = spider.driver.page_source
        return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)


class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
            # 'projects_name.path.to.your.pipeline': 543
        }
    }
    driver = webdriver.Chrome()

    def start_requests(self):

        # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


        floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

        for date in floorsheet_dates:
            self.driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = self.driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))
            for data in range(1, z + 1):
                self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.body = self.driver.page_source
                self.url = self.driver.current_url

                yield Request(url=self.url, callback=self.parse, dont_filter=True)

    def parse(self, response, **kwargs):
        print('test ok')
        for value in response.xpath('//tbody/tr'):
            print(value.css('td::text').extract()[1])
            print("ok"*200)

Notice that I've used chrome so change it back to firefox like in your original code.

SuperUser
  • 4,527
  • 1
  • 5
  • 24
  • thanks for ans I will definitely try this soln will reply you – lord stock Jan 13 '22 at 09:53
  • what is my middleware path if my project name is first_scrapy? – lord stock Jan 13 '22 at 10:03
  • 1
    If it's inside the spider (like in the answer) and the file's name is spider.py (in the answer the filename is yetanotherspider.py), then it will be: `first_scrapy.spiders.spider.SeleniumMiddleware`. But it's best if you put the middleware class inside middlewares.py and then it will be `first_scrapy.middlewares.SeleniumMiddleware`, I only put it in the spider so you could see better. – SuperUser Jan 13 '22 at 10:50
  • I am not getting any data printed :( – lord stock Jan 13 '22 at 17:05
  • It worked for me, `parse` method printed the data. Do you get any errors or something? – SuperUser Jan 13 '22 at 17:11
  • without middlware seems working but with middlware parse method `print('test ok')` is not printed – lord stock Jan 13 '22 at 17:13
  • just want to know does use of middleware makes scraping fast? – lord stock Jan 13 '22 at 17:16
  • In this case no, it's just so you can use `scrapy.Request` with the page from selenium. – SuperUser Jan 13 '22 at 17:22
  • thank you You massively supported me hope I can fix that middleware part too. – lord stock Jan 13 '22 at 17:30
  • hello I am gettingerror when bot goes to last page basically I was trying to save all data in a array and convert it to json by appendining but i am getting none type object is not iterable – lord stock Jan 14 '22 at 05:46
  • Is it working for every other page? (If you can add the updated code it would be great) – SuperUser Jan 14 '22 at 06:18
  • Yeah it works for all pages but when It goes to last page It says none type error, this is all my code in gist just look at line number 88 i think it is thorwing error from there gist link with all code is here ps you can ignore my csv related part `https://gist.github.com/nawarazpokhrel/5626eb9998dba7951bad5e2a739036e8` – lord stock Jan 14 '22 at 06:23
  • any update? regarding issue? – lord stock Jan 14 '22 at 07:34
  • 1
    You never update `final_floor_sheet` so it stays empty. – SuperUser Jan 14 '22 at 08:09
  • i removed that still same error :( – lord stock Jan 14 '22 at 08:27
  • I have updated error trackback – lord stock Jan 14 '22 at 08:27
  • 1
    start_requests: `This method must return an iterable with the first Requests to crawl for this spider`. Make a dummy request at the end of the function, and create a parse method with with `pass` and it should be OK. (if you use the middleware then you don't need to do this). – SuperUser Jan 14 '22 at 08:53
  • can you please update same in ans it will be helpful to others too – lord stock Jan 14 '22 at 08:57
  • @nava I checked and it's enough to return an empty list. I've update the first solution. – SuperUser Jan 14 '22 at 09:30