Yield items with scrapy

Question

I'm having trouble with my spider, the way I have set it up doesn't seem to work. The spider should be able to scrape multiple pages (1,2,3), all on the same website. I'm not sure if I should do a for loop or an if/else statement so extract all the data? I'm getting this code after I run it: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min).

Any help would be greatly appreciated!

Shown below are the code for the spider, items.py, and pipelines.py:

class abcSpider(scrapy.Spider):
name = 'abc'
page_number = 2
allowed_domains = ['']

def parse(self, response):

    items = folder1Item()

    deal_number_var = response.css(".mclbEl a::text").extract()
    deal_type_var = response.css('.#ContentContainer1_ctl00_Content_ListCtrl1_LB1_VDTBL .mclbEl:nth-child(9)').css('::text').extract()


    items['deal_number_var'] = deal_number_var
    items['deal_type_var'] = deal_type_var
    yield items

    next_page = '' + str(abcSpider.page_number) + '/'
    if abcSpider.page_number < 8:
        abcSpider.page_number += 1
        yield response.follow(next_page, callback=self.parse)

This is my items.py page:

import scrapy

class folder1Item(scrapy.Item):
deal_number_var = scrapy.Field()
deal_type_var = scrapy.Field()

I would like to save the data as a .db file to import into sqlite3. It looks like this in my pipelines.py:

import sqlite3

class folder1Pipeline(object):

def __init__(self):
    self.create_connection()
    self.create_table()

def create_connection(self):
    self.conn = sqlite3.connect("abc.db")
    self.curr = self.conn.cursor()

def create_table(self):
    self.curr.execute("""DROP TABLE IF EXISTS abc_tb""")
    self.curr.execute("""create table abc_tb(deal_number_var text, deal_type_var text)""")

def process_item(self, items, spider):
    self.store_db(items)
    return items

def store_db(self,items):
    self.curr.execute("""insert into abc_tb values (?,?,?)""" , (items['deal_number_var'][0], items['deal_type_var'][0]))
    self.conn.commit()

Middleware.py code:

from scrapy.http import HtmlResponse
from selenium import webdriver

class JSMiddleware(object):
def process_request(self, request, spider):
    driver = webdriver.PhantomJS()
    driver.get(request.url)

    body = driver.page_source
    return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

score 0 · Answer 1 · answered Apr 23 '20 at 10:33

0

I assume this is your entire code? If so: you did not define any start_urls. Furthermore you either have to set the allowed_domains correctly or remove the variable completely because right now you define that no url is allowed.

answered Apr 23 '20 at 10:33

carpa_jo

630
6
16

First of all thx for reply! To get access to the page I want to scrape I have to log in, that's why I used Selenium to navigate through all those steps. And now I'm trying to connect it with scrapy. I have updated my code with my middleware.py script... I'm not entirely sure how to redirect the processed HTML link to my spider, and then to set up the 'items' part correctly. Do you have any idea how to solve that? – Simon Smith Apr 23 '20 at 11:03
Personally I would just try to use scrapys build in methods for handling logins (FormRequest.from_response), see [here](https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login) for official documentation or [here](https://python.gotrained.com/scrapy-formrequest-logging-in/) for a complete example. So set the `start_urls` to the login page, utilize the FormRequest.from_response-method to handle the login and then start scraping. And again, make sure to set the allowed_domains correctly or remove the variable or it won't work. – carpa_jo Apr 23 '20 at 11:40
Yeah, I thought about changing it all to Scrapy, it's just that when I gain access to the page through the login page, I have to click through a lot of drop-down menus and add some variables, which "driver.find_element_xpath" helped me with.... Do you have any tips regarding how to connect the two scraping tools? Again thx for your reply! – Simon Smith Apr 23 '20 at 15:13
What you are describing actually sounds quite doable with Scrapy alone. Except if there is a lot of AJAX calls and JavaScript involved, then Selenium might be useful. No personal experience with combining Scrapy and Selenium but if you prefer combining both tools, have a look at [this](https://stackoverflow.com/a/17979285/9003106) and [this](https://stackoverflow.com/a/55880806/9003106). Also you might want to check out [scrapy-splash](https://github.com/scrapy-plugins/scrapy-splash) and this [scrapy middleware](https://github.com/clemfromspace/scrapy-selenium). – carpa_jo Apr 23 '20 at 16:50

Yield items with scrapy

1 Answers1