0

I have to crawl a lot of sites, is there any way?

my tried code gives an error in the callback-function, but I can't solve it. Is there any way to make my code usable, or to do the callback in a list format?

thank you.

import scrapy
from ..items import AppItem
urls = {
    'fun1': 'http://example1.com',
    'fun2': 'https://example2.com',
    # to add link
    # to add link ...
}

item = AppItem()


class Bot(scrapy.Spider):
    name = 'app'

    def start_requests(self):
        for cb in urls:
            yield scrapy.Request(url=urls[cb], callback=cb)
            

    def fun1(self, response):
        item['title'] = response.css('title')

        yield item

    def fun2(self, response):

        item['title'] = response.css('title')

        yield item

error

C:/Python310/python.exe c:/zCode/News/newsScraper/startApp.py
2021-11-26 03:09:06 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request    
    request = next(slot.start_requests)
  File "c:\zCode\News\newsScraper\newsScraper\spiders\app.py", line 18, in start_requests    
    yield scrapy.Request(url=urls[cb], callback=cb)
  File "C:\Python310\lib\site-packages\scrapy\http\request\__init__.py", line 32, in __init__
    raise TypeError(f'callback must be a callable, got {type(callback).__name__}')
TypeError: callback must be a callable, got str
:  [<Selector xpath='descendant-or-self::title' data='<title>Daum</title>'>]
bootoo
  • 23
  • 3

2 Answers2

0

As the exception says, you are passing a string as callback while a callable is expected.

That means doing this instead would work:

    def start_requests(self):
        for cb in urls:
            yield scrapy.Request(url=urls[cb], callback=self.fun1)

Since you are providing your urls and the callback you want to use in code anyway I'd suggest that you just drop your urls dict and yield your requests directly without the loop:

    def start_requests(self):
        yield scrapy.Request(url='http://example1.com', callback=self.fun1)
        yield scrapy.Request(url='http://example2.com', callback=self.fun2)
        ...

That would at least be the easiest solution. If you insist on calling the methods as string, you'll probably need to work with getattr. For that purpose, check this SO question

Patrick Klein
  • 1,161
  • 3
  • 10
  • 23
0

There's an even easier solution by using loaders which takes into account the keys from the dict. This includes them, and you don't have to work with separate functions for each url.

from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
import scrapy


class BotItem(scrapy.Item):
    objects = Field(output_processor = TakeFirst())
    fun = Field(output_processor = TakeFirst())



class Bot(scrapy.Spider):
    name = 'app'
    
    start_urls = {
    'fun1': 'http://example1.com',
    'fun2': 'https://example2.com',
    # to add link
    # to add link ...
}

    def start_requests(self):
        for keys,url in self.start_urls.items():
            yield scrapy.Request(
                url, 
                callback=self.fun1,
                cb_kwargs = {
                    'keys':keys
                })
            

    def fun1(self, response, keys):
        item = response.xpath('//div[@class="container"]')
        for stuff in item:
            l = ItemLoader(BotItem(), selector = stuff)
            l.add_value('fun', keys)
            l.add_xpath('objects', '//div[@class="some_objects"]//test()')
            yield l.load_item()

joe_bill.dollar
  • 374
  • 1
  • 9